Using a trained model, we can predict whether new datapoints belong to the positive class (the group labeled as 1
) using the .predict()
method. The input is a matrix of features and the output is a vector of predicted labels, 1
or 0
.
print(model.predict(features)) # Sample output: [0 1 1 0 0]
If we are more interested in the predicted probability of group membership, we can use the .predict_proba()
method. The input to predict_proba()
is also a matrix of features and the output is an array of probabilities, ranging from 0
to 1
:
print(model.predict_proba(features)[:,1]) # Sample output: [0.32 0.75 0.55 0.20 0.44]
By default, .predict_proba()
returns the probability of class membership for both possible groups. In the example code above, we’ve only printed out the probability of belonging to the positive class. Notice that datapoints with predicted probabilities greater than 0.5 (the second and third datapoints in this example) were classified as 1
s by the .predict()
method. This is a process known as thresholding. As we can see here, sklearn sets the default classification threshold probability as 0.5.
Instructions
In the workspace, we’ve fit the same logistic regression model on the CodecademyU
training data. We’ve also created X_test
and y_test
, which contain the testing data.
Use the .predict()
method to predict whether the students in the test dataset will pass the final exam, then print out the resulting vector of predictions.
Now, use the .predict_proba()
method to calculate the predicted probability that each student in the test dataset will pass the exam. Print out the results.
Print out y_test
to see whether the students in the test dataset actually passed the exam. Did the model make accurate predictions? Looking at the probabilities, do the misclassification(s) make sense?