Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is one of the most fundamental optimization algorithms for training neural networks. In PyTorch, torch.optim.SGD
provides a straightforward way to implement SGD with optional parameters like momentum
, weight_decay
, and nesterov
.
SGD updates model parameters iteratively by calculating the loss function’s gradient for each parameter and then adjusting those parameters in the opposite direction of the gradient.
Syntax
torch.optim.SGD(
params,
lr=0.01,
momentum=0,
weight_decay=0,
dampening=0,
nesterov=False
)
params
: Iterable of parameters to optimize (typicallymodel.parameters()
).lr
: The learning rate (required).momentum
: Value for momentum (default is0
, meaning no momentum).weight_decay
: L2 penalty (default is0
).dampening
: Dampening for momentum (default is0
).nesterov
: Enables Nesterov momentum if set toTrue
(default isFalse
).
Example
Below is a simple example using torch.optim.SGD
to optimize a small neural network:
import torchimport torch.nn as nnimport torch.optim as optim# Sample model: a single-layer neural networkmodel = nn.Sequential(nn.Linear(10, 5),nn.ReLU(),nn.Linear(5, 1))# Loss functioncriterion = nn.MSELoss()# Optimizer: Stochastic Gradient Descentoptimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)# Dummy input and targetx = torch.randn(2, 10) # batch size = 2, input features = 10target = torch.randn(2, 1)# Forward passoutput = model(x)loss = criterion(output, target)# Backward pass and updateloss.backward()optimizer.step()print(f"Loss after one update: {loss.item():.4f}")
The above code prints the following output:
Loss after one update: 0.2851
Here is the step-by-step process used in the above example:
- Define the Model: A simple feed-forward network is created with two
Linear
layers and aReLU
activation. - Set Up Criterion:
MSELoss
is used in this example, but any suitable loss function can be substituted. - Initialize Optimizer: The optimizer is configured with
model.parameters()
, a learning rate of0.01
, and a momentum of0.9
. - Forward Pass: Compute the model’s output given the input tensor.
- Compute Loss: Compare the model’s predictions with the target using MSE.
- Backward Pass: Calculate gradients through a call to
loss.backward()
. - Optimize: Update parameters based on the gradients via
optimizer.step()
.
Running the script prints a loss value indicating how well the network performs on this single batch. In practice, multiple batches and epochs are typically used for training.
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn PyTorch on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly90 hours - Free course
Intro to PyTorch and Neural Networks
Learn how to use PyTorch to build, train, and test artificial neural networks in this course.Intermediate3 hours