Using a trained model, we can predict whether new datapoints belong to the positive class (the group labeled as
1) using the
.predict() method. The input is a matrix of features and the output is a vector of predicted labels,
print(model.predict(features)) # Sample output: [0 1 1 0 0]
If we are more interested in the predicted probability of group membership, we can use the
.predict_proba() method. The input to
predict_proba() is also a matrix of features and the output is an array of probabilities, ranging from
print(model.predict_proba(features)[:,1]) # Sample output: [0.32 0.75 0.55 0.20 0.44]
.predict_proba() returns the probability of class membership for both possible groups. In the example code above, we’ve only printed out the probability of belonging to the positive class. Notice that datapoints with predicted probabilities greater than 0.5 (the second and third datapoints in this example) were classified as
1s by the
.predict() method. This is a process known as thresholding. As we can see here, sklearn sets the default classification threshold probability as 0.5.
In the workspace, we’ve fit the same logistic regression model on the
CodecademyU training data. We’ve also created
y_test, which contain the testing data.
.predict() method to predict whether the students in the test dataset will pass the final exam, then print out the resulting vector of predictions.
Now, use the
.predict_proba() method to calculate the predicted probability that each student in the test dataset will pass the exam. Print out the results.
y_test to see whether the students in the test dataset actually passed the exam. Did the model make accurate predictions? Looking at the probabilities, do the misclassification(s) make sense?