PyTorch Stochastic Gradient Descent

dakshdeepHERE's avatar
Published Jan 29, 2025

Stochastic Gradient Descent (SGD) is one of the most fundamental optimization algorithms for training neural networks. In PyTorch, torch.optim.SGD provides a straightforward way to implement SGD with optional parameters like momentum, weight_decay, and nesterov.

SGD updates model parameters iteratively by calculating the loss function’s gradient for each parameter and then adjusting those parameters in the opposite direction of the gradient.

  • Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.
    • Includes 27 Courses
    • With Professional Certification
    • Beginner Friendly.
      95 hours
  • Learn how to use PyTorch to build, train, and test artificial neural networks in this course.
    • Intermediate.
      3 hours

Syntax

torch.optim.SGD(
  params,
  lr=0.01,
  momentum=0,
  weight_decay=0,
  dampening=0,
  nesterov=False
)
  • params: Iterable of parameters to optimize (typically model.parameters()).
  • lr: The learning rate (required).
  • momentum: Value for momentum (default is 0, meaning no momentum).
  • weight_decay: L2 penalty (default is 0).
  • dampening: Dampening for momentum (default is 0).
  • nesterov: Enables Nesterov momentum if set to True (default is False).

Example

Below is a simple example using torch.optim.SGD to optimize a small neural network:

import torch
import torch.nn as nn
import torch.optim as optim
# Sample model: a single-layer neural network
model = nn.Sequential(
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
# Loss function
criterion = nn.MSELoss()
# Optimizer: Stochastic Gradient Descent
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Dummy input and target
x = torch.randn(2, 10) # batch size = 2, input features = 10
target = torch.randn(2, 1)
# Forward pass
output = model(x)
loss = criterion(output, target)
# Backward pass and update
loss.backward()
optimizer.step()
print(f"Loss after one update: {loss.item():.4f}")

The above code prints the following output:

Loss after one update: 0.2851

Here is the step-by-step process used in the above example:

  1. Define the Model: A simple feed-forward network is created with two Linear layers and a ReLU activation.
  2. Set Up Criterion: MSELoss is used in this example, but any suitable loss function can be substituted.
  3. Initialize Optimizer: The optimizer is configured with model.parameters(), a learning rate of 0.01, and a momentum of 0.9.
  4. Forward Pass: Compute the model’s output given the input tensor.
  5. Compute Loss: Compare the model’s predictions with the target using MSE.
  6. Backward Pass: Calculate gradients through a call to loss.backward().
  7. Optimize: Update parameters based on the gradients via optimizer.step().

Running the script prints a loss value indicating how well the network performs on this single batch. In practice, multiple batches and epochs are typically used for training.

All contributors

Learn PyTorch on Codecademy

  • Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.
    • Includes 27 Courses
    • With Professional Certification
    • Beginner Friendly.
      95 hours
  • Learn how to use PyTorch to build, train, and test artificial neural networks in this course.
    • Intermediate.
      3 hours