Logistic regression not only predicts the class of a sample, but also the probability of a sample belonging to each class. In this way, a measure of certainty is associated to each prediction. In the case of a binary classification, the default threshold value is 50% – predicted probabilities higher than this are associated to the positive class, lower are associated to the negative class. If two samples have predicted probabilities of 51% and 99%, both will be considered positive with the default threshold. However, if the threshold is increased to 60%, now the predicted probability of 51% will be assigned the negative class.
Consider the histogram of the predicted probabilities for the logistic regression classifier trained above. The benign (or negative class) is depicted in blue, the malignant (or positive class) in orange for the breast cancer data set. The benign cases are heavily clustered around zero, which is good as they will be correctly classified as benign, whereas malignant cases are heavily clustered around one. The vertical lines depict hypothetical threshold values at 25%, 50%, and 75%. For the highest threshold, almost all the samples above 75% belong to the malignant class, but there will be some benign cases that are misdiagnosed as malignant (false-positives). In addition, there are a number of malignant cases that are missed (false-negatives). If in stead the lowest threshold value is used, almost all the malignant cases are identified, but there are more false-positives.
Therefore, the value of the threshold is an additional lever that can be used to tune a model’s predicts – higher values are generally associated to lower false-positives/higher false-negatives, whereas a lower value is associated to lower false-negatives/higher false-positives.
Instructions
From the trained logistic regression model, create a new array with the predicted probabilities of each class. Verify if you use a prediction threshold of 0.5 you get the same values as in the original predicted classes array.
Modify the prediction threshold to 0.25 and 0.75 and verify the changes in the confusion matrices.
Given our dataset and that we want to correctly predict as many malignant cases correctly (minimize false-negatives), choose an appropriate threshold so that no more than 2 in 100 malignancies are missed.