Codecademy Logo

PyTorch for Classification

Classification Tasks

In machine learning, classification tasks aim to predict categorical values. These are often called labels or classes.

For example, an ML model that tries to determine if a patient has a disease or not is a classification model. The labels or classes are “has the disease” and “does not have the disease”.

There are two types of classification tasks:

  • Binary classification aims to predict between two classes. For example, predicting whether a patient does or does not have a disease.
  • Multiclass classification aims to predict between more than two classes. For example, predicting whether a patient has the disease, is at high risk of contracting the disease, or is at low risk of contracting the disease.

Label Encoding

Label encoding is a technique to pre-process categorical data by mapping each category to an integer value.

For example, the code snippet for this review card encodes the letters grade A, B, C, D, and F as 4, 3, 2, 1, and 0.

Label encoding is typically how we’d encode the target of a classification model. It is also useful when the categories are already ordered (like the grades above) as that order can be reflected in the integer encoding.

df['Letter_Grade'] = df['Letter_Grade'].replace(

One-Hot Encoding

One-hot encoding is a technique to pre-process categorical data by creating a new binary column for each category.

For example, the code snippet one-hot encodes a student’s living accommodations: “Rental”, “Dorm”, “With Family”, or “Other”. A 1 in a one-hot encoded column indicates that the student in question lives in that space. For example, the first row has a 1 in the first column. This means the first student lives in a rental.

Original ColumnOne-Hot Encoded Columns
ID Accomodation
1 Rental
2 Dormitory
3 With Family
4 Other
ID Rental Dormitory With Family Other
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
df = pd.get_dummies(

Sigmoid Activation Function

The sigmoid activation function is a mathematical function that takes in a numerical real value x as input and outputs a value between 0 and 1.

sigmoid(x)=11+ex\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

For example, the image attached to this review card demonstrates that the sigmoid output for 2.5 is very close to 1 (precisely .924).

In binary classification, we’d interpret this as a 92.4% probability that the data point should be labeled 1 (as opposed to 0).

A graph of a function. The horizontal x axis is labeled left to right with -3, -2, -1, 0 , 1, 2, and 3. The vertical y-axis goes up from the x-axis at 0. There is a horizontal line at y=1, labeled 1. There is an orange curve. From left to right, the height of the orange curve goes from almost 0, curves up to .5 when it intersects the y-axis, and continues increasing towards 1 but never quite reaching 1, seeming to level out (although mathematically it does keep increasing towards 1 though never reaching 1). There is a blue vertical line at x=2.5. There is a blue dot where the line intersects the orange sigmoid curve. The height of the blue dot is very close to 1.

Sigmoid Activation for Binary Classification

Binary classification neural networks need the sigmoid activation function on the output layer. The sigmoid will transform the neural network output into a probability, allowing us to determine the right label for the input data.

PyTorch implements sigmoid in nn.Sigmoid().

model = nn.Sequential(

Threshold for Binary Classification

In binary classification tasks, the neural network outputs a probability that the input data should be labeled 1 (as opposed to 0.

A threshold converts the probability into a label: 1 or 0.

For example, if the threshold is 0.5, any probability greater than or equal to 0.5 results in a prediction of 1. Otherwise, the prediction is 0.

The ML engineer determines the threshold, so the exact threshold may depend on what the goal of a specific model is. A good starting point is a 0.5 threshold.

threshold = 0.5 # threshold may vary
classification = int(probability >= threshold)

Binary Cross-Entropy Loss Function in PyTorch

The binary cross-entropy loss (BCELoss) is a common loss function for binary classification tasks. It measures model performance based on how far the predicted probabilities are from their true target label.

When the true classification is 1, the BCE loss uses the negative logarithm on the sigmoid probability p:

BCELoss(p)=log(p)\text{BCELoss}(p) = -\log(p)

When the true classification is 0, the BCE loss uses the negative logarithm on 1-p:

BCELoss(p)=log(1p)\text{BCELoss}(p) = -\log(1-p)
# by hand definition of BCELoss for a single probability
def BCELoss(p,y):
if y == 1: #if the true classification is 1
return -np.log(p)
else: # if the true classification is 0
return -np.log(1-p)
# importing a full BCELoss implementation from PyTorch
from torch import nn
loss = nn.BCELoss()

Stochastic Gradient Descent in PyTorch

Stochastic gradient descent (SGD) is a variant of the traditional gradient descent optimization algorithm. Instead of using the entire training set to compute the gradient of the loss function, SGD uses a single randomly chosen data point.

import torch.optim as optim
optimizer = optim.SGD(

Classification Metric - Accuracy

Accuracy is a common evaluation metric for classification models. It measures the percent of correct predictions:

Accuracy=# of correct predictions# of predictions\text{Accuracy} = \frac{\text{\# of correct predictions}}{\text{\# of predictions}}

We can also think of accuracy in the context of false positives/negatives and true positives/negatives:

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}

The accuracy of a model can be computed using the accuracy_score function from Scikit-learn by passing in the true class labels and the model’s predicted labels.

from sklearn.metrics import accuracy_score
# Calculate accuracy
accuracy = accuracy_score(y_train, predicted_labels)

Classification Metrics - Precision, Recall, and F1 Score

Evaluation metrics other than accuracy include precision, recall, and F1-score.

Precision pays attention to false positives whereas recall pays attention to false negatives:

Precision=TPTP+FPRecall=TPTP+FN\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}} \hspace{1cm} \text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}

The F1 Score is the harmonic mean of precision and recall that balances concerns for false positives and false negatives:

F1=2PrecisionRecallPrecision+Recall\text{F1}=\frac{2*\text{Precision}*\text{Recall}}{\text{Precision} + \text{Recall}}

The classification report generates a summary of the precision, recall, and F1 scores for each class where in the cases with more than two classes:

  • the macro average gives equal weight to each class
  • the micro average gives more weight to classes with a larger # of observations (support).
from sklearn.metrics import classification_report
report = classification_report(true_labels, predicted_labels)

Softmax Activation for Multiclass Classification

In multiclass classification tasks, the softmax function takes the output of the neural network and forms a probability distribution. This helps us interpret the output by giving a probability that the input datapoint belongs to each potential class.

Many PyTorch functions already have softmax built-in (like nn.CrossEntropyLoss), so we often won’t “see” softmax applied directly in a multiclass network.

For example, if the network outputs the following [0.9, 0.8, 0.4], we use the normalization factor

e.9+e.8+e.4e^{.9} + e^{.8} + e^{.4}

and calculate the softmax output with

[e.9e.9+e.8+e.4,e.8e.9+e.8+e.4,e.4e.9+e.8+e.4]=[0.4,0.36,0.24][\frac{e^{.9}}{e^{.9} + e^{.8} + e^{.4}}, \frac{e^{.8}}{e^{.9} + e^{.8} + e^{.4}}, \frac{e^{.4}}{e^{.9} + e^{.8} + e^{.4}}] = [0.4, 0.36, 0.24]

All of the probabilities in the softmax output sum to 1 or 100%. There’s a 40% chance, according to the network, that the input data belongs to the first class.

Argmax Function for Multiclass Classification

The argmax function is applied to the softmax output to return the index of the class with the highest probability. When the target labels are integers 0,1,2,..., applying argmax essentially just converts the probability output to a label output, completing the multiclass prediction process!

For example, if the softmax output is [0.4, 0.36, 0.24], the argmax function will return the first index 0, predicting that the corresponding input data belonged to the class labeled 0.

import torch
softmax_output = torch.tensor([0.4, 0.36, 0.24], dtype=torch.float)
argmax_output = torch.argmax(softmax_output, dim=1)
# output: 0

Cross Entropy Loss Function for Multiclass Classification

The multiclass version of the cross-entropy loss function can be implemented in PyTorch using nn.CrossEntropyLoss().

PyTorch’s implementation applies the softmax function (or a logarithmic version of it) automatically, which is why we don’t need to apply softmax in a multi-class network directly.

Mathematically, the multiclass version computes a general version of the negative logarithm with respect to each true classification label.

import torch.nn as nn
loss = nn.CrossEntropyLoss()

Learn More on Codecademy