As we’ve seen, logistic regression is used to predict the probability of group membership. Once we have this probability, we need to make a decision about what class a datapoint belongs to. This is where the classification threshold comes in!
The default threshold for sklearn
is 0.5
. If the predicted probability of an observation belonging to the positive class is greater than or equal to the threshold, 0.5
, the datapoint is assigned to the positive class.
We can choose to change the threshold of classification based on the use-case of our model. For example, if we are creating a logistic regression model that classifies whether or not an individual has cancer, we may want to be more sensitive to the positive cases. We wouldn’t want to tell someone they don’t have cancer when they actually do!
In order to ensure that most patients with cancer are identified, we can move the classification threshold down to 0.3
or 0.4
, increasing the sensitivity of our model to predicting a positive cancer classification. While this might result in more overall misclassifications, we are now missing fewer of the cases we are trying to detect: actual cancer patients.
Instructions
In the workspace, we’ve fit the same logistic regression model on the CodecademyU
training data. We’ve also printed the predicted classes and true classes for the test data.
Take a look at the predicted probability of passing the exam for the mis-classified datapoint. The .predict()
method uses a default threshold of 0.5
for predicting group membership. For this example, we could correctly classify all five datapoints in the test dataset using a different threshold.
Set the value of alternative_threshold
to any value that would accomplish this.