Learn

Logistic Regression

Scikit-Learn

Now that you know the inner workings of how Logistic Regression works, let’s learn how to easily and quickly create Logistic Regression models with `sklearn`

! `sklearn`

is a Python library that helps build, train, and evaluate Machine Learning models.

To take advantage of `sklearn`

‘s abilities, we can begin by creating a `LogisticRegression`

object.

`model = LogisticRegression()`

After creating the object, we need to fit our model on the data. When we fit the model with `sklearn`

it will perform gradient descent, repeatedly updating the coefficients of our model in order to minimize the log-loss. We train — or fit — the model using the `.fit()`

method, which takes two parameters. The first is a matrix of features, and the second is a matrix of class labels.

`model.fit(features, labels)`

Now that the model is trained, we can access a few useful attributes of the `LogisticRegression`

object.

`model.coef_`

is a vector of the coefficients of each feature`model.intercept_`

is the intercept`b_0`

With our trained model we are able to predict whether new data points belong to the positive class using the `.predict()`

method! `.predict()`

takes a matrix of features as a parameter and returns a vector of labels `1`

or `0`

for each sample. In making its predictions, `sklearn`

uses a classification threshold of `0.5`

.

`model.predict(features)`

If we are more interested in the predicted probability of the data samples belonging to the positive class than the actual class, we can use the `.predict_proba()`

method. `predict_proba()`

also takes a matrix of features as a parameter and returns a vector of probabilities, ranging from `0`

to `1`

, for each sample.

`model.predict_proba(features)`

Before proceeding, one important note is that `sklearn`

‘s Logistic Regression implementation requires feature data to be normalized. Normalization scales all feature data to vary over the same range. `sklearn`

‘s Logistic Regression requires normalized feature data due to a technique called *Regularization* that it uses under the hood. Regularization is out of the scope of this lesson, but in order to ensure the best results from our model, we will be using a normalized version of the data from our Codecademy University example.

Let’s build, train and evaluate a Logistic Regression model in `sklearn`

for our Codecademy University data! We’ve imported `sklearn`

and the `LogisiticRegression`

classifier for you. Create a Logistic Regression model named `model`

.

Train the model using `hours_studied_scaled`

as the training features and `passed_exam`

as the training labels.

Save the coefficients of the model to the variable `calculated_coefficients`

, and the intercept of the model to `intercept`

. Print `calculated_coefficients`

and `intercept`

.

The next semester a group of students in the Introductory Machine Learning course want to predict their final exam scores based on how much they intended to study for the exam. The number of hours each student thinks they will study, normalized, is given in `guessed_hours_scaled`

. Use `model`

to predict the probability that each student will pass the final exam, and save the probabilities to `passed_predictions`

.

That same semester, the Data Science department decides to update the final exam passage model to consider two features instead of just one. During the final exam, students were asked to estimate how much time they spent studying, as well as how many previous math courses they have taken. The student responses, along with their exam results, were split into training and test sets. The training features, normalized, are given to you in `exam_features_scaled_train`

, and the students’ results on the final are given in `passed_exam_2_train`

.

Create a new Logistic Regression model named `model_2`

and train it on `exam_features_scaled_train`

and `passed_exam_2_train`

.

Use the model you just trained to predict whether each student in the test set, `exam_features_scaled_test`

, will pass the exam and save the predictions to `passed_predictions_2`

. Print `passed_predictions_2`

.

Compare the predictions to the actual student performance on the exam in the test set. How well did your model do?