In machine learning, classification tasks aim to predict categorical values. These are often called labels or classes.
For example, an ML model that tries to determine if a patient has a disease or not is a classification model. The labels or classes are “has the disease” and “does not have the disease”.
There are two types of classification tasks:
Label encoding is a technique to preprocess categorical data by mapping each category to an integer value.
For example, the code snippet for this review card encodes the letters grade A, B, C, D, and F as 4, 3, 2, 1, and 0.
Label encoding is typically how we’d encode the target of a classification model. It is also useful when the categories are already ordered (like the grades above) as that order can be reflected in the integer encoding.
df['Letter_Grade'] = df['Letter_Grade'].replace({'A':4,'B':3,'C':2,'D':1,'F':0})
Onehot encoding is a technique to preprocess categorical data by creating a new binary column for each category.
For example, the code snippet onehot encodes a student’s living accommodations: “Rental”, “Dorm”, “With Family”, or “Other”. A 1
in a onehot encoded column indicates that the student in question lives in that space. For example, the first row has a 1
in the first column. This means the first student lives in a rental.
Original Column  OneHot Encoded Columns  



df = pd.get_dummies(df,columns=['Accomodation'],dtype=int)
The sigmoid activation function is a mathematical function that takes in a numerical real value x
as input and outputs a value between 0 and 1.
$\text{sigmoid}(x) = \frac{1}{1+e^{x}}$
For example, the image attached to this review card demonstrates that the sigmoid output for 2.5
is very close to 1
(precisely .924).
In binary classification, we’d interpret this as a 92.4% probability that the data point should be labeled 1
(as opposed to 0
).
Binary classification neural networks need the sigmoid activation function on the output layer. The sigmoid will transform the neural network output into a probability, allowing us to determine the right label for the input data.
PyTorch implements sigmoid in nn.Sigmoid()
.
model = nn.Sequential(nn.Linear(5,3),nn.ReLU(),nn.Linear(3,1)nn.Sigmoid())
In binary classification tasks, the neural network outputs a probability that the input data should be labeled 1
(as opposed to 0
.
A threshold converts the probability into a label: 1
or 0
.
For example, if the threshold is 0.5
, any probability greater than or equal to 0.5
results in a prediction of 1
. Otherwise, the prediction is 0
.
The ML engineer determines the threshold, so the exact threshold may depend on what the goal of a specific model is. A good starting point is a 0.5
threshold.
threshold = 0.5 # threshold may varyclassification = int(probability >= threshold)
The binary crossentropy loss (BCELoss) is a common loss function for binary classification tasks. It measures model performance based on how far the predicted probabilities are from their true target label.
When the true classification is 1
, the BCE loss uses the negative logarithm on the sigmoid probability p
:
$\text{BCELoss}(p) = \log(p)$
When the true classification is 0
, the BCE loss uses the negative logarithm on 1p
:
$\text{BCELoss}(p) = \log(1p)$
# by hand definition of BCELoss for a single probabilitydef BCELoss(p,y):if y == 1: #if the true classification is 1return np.log(p)else: # if the true classification is 0return np.log(1p)# importing a full BCELoss implementation from PyTorchfrom torch import nnloss = nn.BCELoss()
Stochastic gradient descent (SGD) is a variant of the traditional gradient descent optimization algorithm. Instead of using the entire training set to compute the gradient of the loss function, SGD uses a single randomly chosen data point.
import torch.optim as optimoptimizer = optim.SGD(model.parameters(),lr=0.001)
Accuracy is a common evaluation metric for classification models. It measures the percent of correct predictions:
$\text{Accuracy} = \frac{\text{\# of correct predictions}}{\text{\# of predictions}}$
We can also think of accuracy in the context of false positives/negatives and true positives/negatives:
$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$
The accuracy of a model can be computed using the accuracy_score
function from Scikitlearn by passing in the true class labels and the model’s predicted labels.
from sklearn.metrics import accuracy_score# Calculate accuracyaccuracy = accuracy_score(y_train, predicted_labels)
Evaluation metrics other than accuracy include precision, recall, and F1score.
Precision pays attention to false positives whereas recall pays attention to false negatives:
$\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}
\hspace{1cm}
\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$
The F1 Score is the harmonic mean of precision and recall that balances concerns for false positives and false negatives:
$\text{F1}=\frac{2*\text{Precision}*\text{Recall}}{\text{Precision} + \text{Recall}}$
The classification report generates a summary of the precision, recall, and F1 scores for each class where in the cases with more than two classes:
from sklearn.metrics import classification_reportreport = classification_report(true_labels, predicted_labels)
In multiclass classification tasks, the softmax function takes the output of the neural network and forms a probability distribution. This helps us interpret the output by giving a probability that the input datapoint belongs to each potential class.
Many PyTorch functions already have softmax builtin (like nn.CrossEntropyLoss
), so we often won’t “see” softmax applied directly in a multiclass network.
For example, if the network outputs the following [0.9, 0.8, 0.4]
, we use the normalization factor
$e^{.9} + e^{.8} + e^{.4}$
and calculate the softmax output with
$[\frac{e^{.9}}{e^{.9} + e^{.8} + e^{.4}}, \frac{e^{.8}}{e^{.9} + e^{.8} + e^{.4}}, \frac{e^{.4}}{e^{.9} + e^{.8} + e^{.4}}] = [0.4, 0.36, 0.24]$
All of the probabilities in the softmax output sum to 1
or 100%
. There’s a 40% chance, according to the network, that the input data belongs to the first class.
The argmax function is applied to the softmax output to return the index of the class with the highest probability. When the target labels are integers 0,1,2,...
, applying argmax essentially just converts the probability output to a label output, completing the multiclass prediction process!
For example, if the softmax output is [0.4, 0.36, 0.24]
, the argmax function will return the first index 0
, predicting that the corresponding input data belonged to the class labeled 0
.
import torchsoftmax_output = torch.tensor([0.4, 0.36, 0.24], dtype=torch.float)argmax_output = torch.argmax(softmax_output, dim=1)argmax_output# output: 0
The multiclass version of the crossentropy loss function can be implemented in PyTorch using nn.CrossEntropyLoss()
.
PyTorch’s implementation applies the softmax function (or a logarithmic version of it) automatically, which is why we don’t need to apply softmax in a multiclass network directly.
Mathematically, the multiclass version computes a general version of the negative logarithm with respect to each true classification label.
import torch.nn as nnloss = nn.CrossEntropyLoss()