When we fit a machine learning model, we need some way to evaluate it. Often, we do this by splitting our data into training and test datasets. We use the training data to fit the model; then we use the test set to see how well the model performs with new data.
As a first step, data scientists often look at a confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives.
For example, suppose that the true and predicted classes for a logistic regression model are:
y_true = [0, 0, 1, 1, 1, 0, 0, 1, 0, 1] y_pred = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
We can create a confusion matrix as follows:
from sklearn.metrics import confusion_matrix print(confusion_matrix(y_true, y_pred))
Output:
array([[3, 2], [1, 4]])
This output tells us that there are 3
true negatives, 1
false negative, 4
true positives, and 2
false positives. Ideally, we want the numbers on the main diagonal (in this case, 3
and 4
, which are the true negatives and true positives, respectively) to be as large as possible.
Instructions
In the workspace, we’ve fit the same logistic regression model on the codecademyU
training data and made predictions for the test data. y_test
contains the true classes and y_pred
contains the predicted classes.
Create and print a confusion matrix for this data. How many incorrect classifications were there (false positives or false negatives)?