Adagrad

MamtaWardhani's avatar
Published Mar 26, 2025
Contribute to Docs

Adagrad is an adaptive learning rate optimization algorithm used for training deep learning models. It automatically adapts the learning rate for each parameter based on the historical gradient information, allowing for more efficient training especially with infrequent data. Adagrad modifies the general learning rate for each parameter according to the accumulated squared gradients, giving larger updates to infrequent parameters and smaller updates to frequent ones.

Adagrad is particularly effective when dealing with infrequent data, such as in natural language processing and recommendation systems. By automatically adjusting learning rates for each parameter, it eliminates the need for manual tuning of the learning rate hyperparameter, which can be a time-consuming process in model optimization.

Working of Adagrad with Detailed Steps

Adagrad works by maintaining a sum of squared gradients for each parameter and using this information to adapt the learning rate. Here’s how it works:

1. Initialization: The model parameters are initialized, and a small value (ε) is defined to prevent division by zero.

2. Gradient Computation: For each parameter at time step t, compute the gradient of the loss function with respect to that parameter.

3. Accumulate Squared Gradients: Keep track of the sum of squared gradients for each parameter over time.

$$G_t = G_{t-1} + g_t^2$$

Where $$\ G_t $$ is the accumulated squared gradient and $$\ g_t $$ is the current gradient.

4. Update Parameters: Adjust each parameter using its own adaptive learning rate.

$$\theta_{t+1} = \theta_t - \left( \frac{\eta}{\sqrt{G_t + \epsilon}} \right) g_t$$

Where:

  • $$\ \theta_t $$ is the parameter at time t
  • η is the initial learning rate
  • $$\ G_t $$ is the sum of squared gradients up to time t
  • ε is a small constant to prevent division by zero
  • $$\ g_t $$ is the gradient at time t

This approach causes parameters with frequently occurring features to receive smaller updates, while parameters with infrequent features receive larger updates.

Comparison with Adam and RMSprop

Adagrad adapts learning rates but suffers from diminishing rates, Adam balances first and second moment estimates for stability, and RMSprop prevents diminishing rates by using a moving average of squared gradients. To explore more about each, refer to their respective entries:

Advantages and Limitations of Adagrad

Advantages

  1. Adaptive Learning Rates: Automatically adjusts the learning rate for each parameter based on historical gradients.
  2. Efficient for Sparse Data: Performs well on sparse datasets by giving larger updates to infrequent parameters.
  3. No Manual Tuning: Reduces the need for manual tuning of learning rates for different parameters.
  4. Feature-Specific Learning: Allows different learning rates for different features based on their importance and frequency.

Limitations

  1. Diminishing Learning Rates: The accumulated squared gradients in the denominator continue to grow during training, causing the learning rate to become infinitesimally small and eventually stopping the learning process.
  2. Memory Requirements: Requires storing an additional variable (accumulated gradient) for each parameter, which can be memory-intensive for very large models.
  3. Initial Learning Rate Sensitivity: Performance can still be sensitive to the choice of initial learning rate.
  4. No Momentum: Lacks a momentum term, which can sometimes lead to slower convergence in certain problems.

Syntax

In PyTorch, Adagrad can be implemented using the following syntax:

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

Parameters:

  • params: Iterable of parameters to optimize or dicts defining parameter groups
  • lr: Learning rate (default=0.01)
  • lr_decay: Learning rate decay (default=0)
  • weight_decay: Weight decay (L2 penalty) (default=0)
  • initial_accumulator_value: Initial value for the sum of squared gradients (default=0)
  • eps: Term added to the denominator to improve numerical stability (default=1e-10)

Return value:

It returns an optimizer instance that can be used to update model parameters.

Example 1: Basic Adagrad Implementation with Linear Regression

This example demonstrates how to use Adagrad optimizer for a simple linear regression problem:

import torch
import torch.nn as nn
import torch.optim as optim
# Create synthetic data
x = torch.linspace(-10, 10, 100).view(-1, 1)
y = 2 * x + 3 + 0.1 * torch.randn(x.size()) # y = 2x + 3 + noise
# Define a simple linear model
class LinearRegression(nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
# Initialize model, loss function, and optimizer
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.Adagrad(model.parameters(), lr=0.1)
# Training loop
for epoch in range(100):
# Forward pass
y_pred = model(x)
loss = criterion(y_pred, y)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch+1) % 20 == 0:
print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
# Print model parameters
w, b = model.linear.weight.item(), model.linear.bias.item()
print(f'Model parameters: w = {w:.4f}, b = {b:.4f}')
print(f'Ground truth: w = 2, b = 3')

This example shows how Adagrad adjusts learning rates based on gradient history, helping converge to the optimal solution for a linear regression problem.

Example 2: Training a Neural Network for Image Classification

This example demonstrates using Adagrad for training a neural network on the MNIST dataset:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
# Load MNIST dataset (code shortened for brevity)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
# Define a simple neural network
class MNISTNet(nn.Module):
def __init__(self):
super(MNISTNet, self).__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, 128)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.flatten(x)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Initialize model, loss function, and optimizer
model = MNISTNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adagrad(model.parameters(), lr=0.01)
# Training loop (abbreviated)
def train(epochs=3):
for epoch in range(epochs):
running_loss = 0.0
for i, (inputs, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.4f}')
train() # Train for 3 epochs

Example 3: Natural Language Processing with Word Embeddings

This example demonstrates using Adagrad for training word embeddings in a simple text classification task:

import torch
import torch.nn as nn
import torch.optim as optim
# Sample data (simplified)
texts = ["I love this movie", "Great film", "Terrible experience", "Waste of time"]
labels = [1, 1, 0, 0] # 1 for positive, 0 for negative
# Create a simple vocabulary
word_to_idx = {}
for text in texts:
for word in text.lower().split():
if word not in word_to_idx:
word_to_idx[word] = len(word_to_idx)
# Simple text classification model
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(TextClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.fc = nn.Linear(embedding_dim, 1)
def forward(self, x):
embedded = self.embedding(x)
# Take average of embeddings across sequence length
embedded = torch.mean(embedded, dim=1)
return self.fc(embedded).squeeze()
# Initialize model with Adagrad optimizer
vocab_size = len(word_to_idx)
model = TextClassifier(vocab_size, embedding_dim=50)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adagrad(model.parameters(), lr=0.1)
# Training would continue here with proper batching and data loading
print(f"Model initialized with vocab size: {vocab_size}")
print(f"Optimizer: Adagrad with learning rate 0.1")

This example shows how Adagrad can be applied to NLP tasks where sparse features are common, as it provides larger updates for rare words and smaller updates for frequent ones.

Explore these optimization algorithms in Codecademy’s Data Scientist: Machine Learning career path, which covers machine learning fundamentals, model training, and advanced techniques.

All contributors

Contribute to Docs

Learn AI on Codecademy