Adam Optimization

Anonymous contributor's avatar
Anonymous contributor
Published Oct 4, 2024
Contribute to Docs

Adam (Adaptive Moment Estimation) is a widely used optimization algorithm in deep learning. It combines the momentum technique from Stochastic Gradient Descent (SGD) with the adaptive learning rate of RMSprop (Root Mean Square Propagation). Adam can also be viewed as an extension of AdaDelta, refining the learning rate adjustments for each parameter.

It adapts the learning rate for each parameter, making the training process faster and more efficient, especially for large datasets. Adam’s ability to handle sparse gradients and adapt dynamically to different scenarios makes it highly efficient for training deep neural networks.

Working of Adam Optimization

Adam works by computing adaptive learning rates for each parameter, which helps in adjusting the step size during optimization. It maintains two moving averages for each parameter: one for the gradients (momentum), and the other for the squared gradients (similar to RMSProp). These moving averages help to smooth out noisy updates and ensure consistent progress towards the minimum.

One of Adam’s key features is bias correction, which adjusts the moving averages at the start of training to prevent initial values from being too small. This adjustment enhances Adam’s stability during the initial stages of training. Furthermore, Adam adapts to different problem structures and works well with non-stationary data, such as datasets that change over time or have dynamic properties.

Finally, because Adam adjusts learning rates individually for each parameter, it excels in problems where gradients vary widely or parameters have differing levels of importance.

Key Advantages

  • Adaptive Learning Rates: Adam adapts learning rates for each parameter, leading to more efficient training, especially in complex models.
  • Bias Correction: This feature is significant in the early stages of training, as it adjusts the moving averages to prevent them from being too small, enhancing stability.
  • Robustness: Adam performs well with noisy gradients and varying data distributions, making it suitable for real-world applications.

Example

MNIST Classification with Adam Optimizer

This example shows how to train a neural network using the Adam optimizer with TensorFlow/Keras on the MNIST dataset, which consists of 70,000 grayscale images of handwritten digits (0-9) commonly used for benchmarking machine learning algorithms:

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Normalize the images to values between 0 and 1
x_train, x_test = x_train / 255.0, x_test / 255.0
# One-hot encode labels
y_train, y_test = to_categorical(y_train, 10), to_categorical(y_test, 10)
# Build and compile the model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.2)
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test Accuracy: {test_acc:.4f}')

The above code generates the following output:

Test Accuracy: 0.9731

Note: The output will change with each run due to random weight initialization, data shuffling, and the stochastic nature of the Adam optimizer.

All contributors

Contribute to Docs

Learn AI on Codecademy