Articles

Stochastic Gradient Descent (SGD) Explained With Implementation in R

Learn stochastic gradient descent fundamentals and implement SGD in R with step-by-step code examples, early stopping, and deep learning applications.

In machine learning (ML), optimization algorithms are at the heart of training models. Gradient descent and its variations, such as stochastic gradient descent (SGD) and mini-batch stochastic gradient descent, are some of the most popular optimization algorithms that we use to train ML models due to their simplicity and efficiency.

This article discusses the basics of gradient descent and stochastic gradient descent to help you understand how they work. Along with the basics, we will discuss how to implement stochastic gradient descent in R. We will also discuss the applications of SGD in R and its advantages and disadvantages to help you understand how to use SGD to train models in R.

What is gradient descent?

Imagine someone blindfolds you and drops you at a random height on a hill. Now, you want to climb down the hill quickly, but can’t see anything as you are blindfolded. How would you proceed then?

Since you cannot see your surroundings, you can use your feet to analyze the direction in which the hill has the most downward slope. Then, move a little distance in that direction and analyze the slope again. This way, you can reach the bottom of the hill very quickly. This is what gradient descent does while training ML models. Here,

  • The hill represents the loss function.
  • The hill’s height at any point represents how bad our model is, which is denoted by loss. We calculate the loss using metrics like root mean squared error (RMSE), mean squared error (MSE), etc.
  • The slope of the hill at any point represents the gradient. Here, the gradient is the partial derivative of the loss function with respect to a model parameter.
  • Each step to climb down from the hill represents a change in the model’s parameters.
  • The lowest point of the hill represents the best solution with minimum loss. It is the point where the slope of the loss function, i.e., the gradient, becomes zero or very small.

Gradient descent is an optimization technique that we use while training ML models to minimize a loss function by iteratively moving toward the steepest descent. The gradient descent aims to minimize a loss function J(θ), where θ represents the model parameters. To achieve the goal, we use the following steps:

First, we calculate the partial derivative of the loss function J(θ) with respect to the model parameter θi to find the direction of the steepest change in the loss function.

J(θ)θi\frac{\partial J(\theta)}{\partial \theta_i}

Next, We update θi using the following formula:

θi=θiαJ(θ)θi\theta_i = \theta_i - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_i}

Here,

  • θi is the ith model parameter that we want to update.
  • α is the learning rate. It decides how big of a step we need to take in the direction of the negative gradient. Common choices for α are values like 0.1, 0.01, or 0.001, but we often need to tune it for a given problem.
  • ∂J(θ)⁄∂θᵢ is the loss function’s gradient with respect to the parameter θi, computed using all the data points in the training set.

We compute the loss function’s gradient and update the model parameters iteratively until the gradient ∂J(θ)⁄∂θᵢ becomes close to zero. At this point, we consider that the model has converged to the optimal solution.

To visualize how gradient descent works, suppose we have the loss function J(θ)= θ2. The gradient of the loss function at any θ will be ∂J(θ)⁄∂θ = 2θ.

We randomly start at θ value 10 with α=0.1. At this point, the gradient will be 2θ, i.e., 20. Using these values, we will update θ as follows:

θ=θα2θθ=100.1×2×10θ=8\begin{aligned} \theta = \theta - \alpha \cdot 2\theta \\ \theta = 10 - 0.1 \times 2 \times 10 \\ \theta = 8 \end{aligned}

Hence, the new value of θ will be 8. At this point, we will again calculate the gradient and update θ, which will become 6.4. We will continue this process until the gradient becomes zero. The entire gradient descent process will look as follows:

Gradient descent visualization

In this image, each red dot represents a θ value, and each red line segment represents a step in the gradient descent process, which converges at θ=0.

In gradient descent, the loss is calculated for all the data points in the training data, and every update step requires computing the gradient using the entire training dataset. For large datasets, this process can be inefficient. In such cases, we use stochastic gradient descent to optimize the model parameters.

Related Course

How to Choose a Linear Regression Model

Learn about the differences between different regression models and how to decide which one to use.Try it for free

What is stochastic gradient descent (SGD)?

Stochastic gradient descent is a variant of gradient descent commonly used with large datasets. In SGD, we update the model parameter θi using only one training sample at a single step, which makes parameter updates faster than gradient descent.

To update the model parameter θi, we calculate the gradient of the loss function with respect to the parameter θi, based only on the sample (xk, yk) as follows:

θi=θiαJ(θ;xk,yk)θi\theta_i = \theta_i - \alpha \cdot \frac{\partial J(\theta; x_k, y_k)}{\partial \theta_i}

Here,

  • θi is the ith parameter of the model that we want to update.
  • α is the learning rate.
  • ∂J(θ,xₖ, yₖ)⁄∂θᵢ is the gradient of the loss function computed using only (xk, yk).

Updating model parameters using a single data point instead of the entire dataset helps SGD converge faster than gradient descent. Having discussed the theoretical explanation, let’s implement stochastic gradient descent in R.

How to implement stochastic gradient descent in R?

We will implement stochastic gradient descent in R to optimize the parameters of a linear regression model with MSE as the loss function. Let’s discuss the step-by-step process for implementing SGD in R.

Step 1: Generate a synthetic dataset

To implement SGD, we will generate sample data points around the line y = 3x + 5 by adding some noise. For this,

  • Using uniform distribution, we will first generate hundred x values between 0 and 10. For this, we will use the runif() function.
  • Next, we will generate y values around the line y = 3x + 5 by introducing noise using the rnorm() function. We will add noise with values ranging from 0 to 2.
# Set random seed
set.seed(42)
# Define the number of samples
n <- 100
# Generate random x values between 0 to 10 with uniform distribution
x <- runif(n, 0, 10)
# Generate y values with noise between 0 to 2
y <- 3 * x + 5 + rnorm(n, 0, 2)

After generating the data points, we will initialize the model parameters.

Step 2: Randomly initialize model parameters

We will be fitting the generated data on a straight line y= wx+ b. Hence, we will randomly initialize the model parameters w and b using the runif() function.

# Randomly initialize model parameters for line y = w*x+b
w <- runif(1)
b <- runif(1)
sprintf("w = %.4f", w)
sprintf("b = %.4f", b)

Output:

'w = 0.4838'
'b = 0.4446'

After initializing the model parameters, we will define the learning rate α and epochs for the gradient descent process.

Step 3: Specify the learning rate and number of epochs

We will set the learning rate to 0.01 to update the model parameters. We must also specify the maximum number of iterations (epochs) for parameter updates in the gradient descent process. We will use 20 epochs to implement SGD.

# Define learning rate and number of epochs
lr <- 0.01
epochs <- 20
sprintf("learning rate = %.2f", lr)
sprintf("epochs = %d", epochs)

Output:

'learning rate = 0.01'
'epochs = 20'

After initializing the model parameters and specifying the learning rate and the number of epochs, we will implement the logic to update the model parameters using the stochastic gradient descent algorithm.

Step 4: Implement logic to update the model parameters using SGD

To implement the logic for SGD, we will use the following steps:

  • We will go through each point in the dataset for the given number of epochs. We will predict the y value using xi and the linear equation w xᵢ + b for each point in the dataset.
  • Next, we will compute the gradients of the loss function with respect to the parameters w and b. As we have used mean squared error as the loss function, we can calculate the gradients of the loss function at point (xᵢ,yᵢ) as follows:
J(w,b)=(yi(wxi+b))2Jw=2xi(yi(wxi+b))Jb=2(yi(wxi+b))\begin{aligned} J(w, b) &= (y_i - (w x_i + b))^2 \\ \frac{\partial J}{\partial w} &= -2 x_i \left( y_i - (w x_i + b) \right) \\ \frac{\partial J}{\partial b} &= -2 \left( y_i - (w x_i + b) \right) \end{aligned}
  • After calculating the gradients, we will update the model parameters w and b as follows:
w=wαJwb=bαJb\begin{aligned} w = w - \alpha \cdot \frac{\partial J}{\partial w} \\ b = b - \alpha \cdot \frac{\partial J}{\partial b} \end{aligned}

We will repeat the update process for the defined number of epochs. This entire process can be implemented as follows:

# Repeat the update process epoch number of times
for (epoch in 1:epochs) {
# Update the model parameters for each data point
for (i in 1:n) {
# Select a data point (xi,yi)
xi <- x[i]
yi <- y[i]
# Predict y value using xi
y_pred <- w * xi + b
# Compute gradients using y_pred and yi assuming MSE as the loss function
dw <- -2 * (yi - y_pred) * xi
db <- -2 * (yi - y_pred)
# Update model parameters
w <- w - lr * dw
b <- b - lr * db
}
}

After executing this code, the parameters w and b are updated, as shown in the following output:

sprintf("w = %.4f", w)
sprintf("b = %.4f", b)

Output:

'w = 2.8466'
'b = 4.5495'

As you can see, the stochastic gradient process gives w as 2.8466 and b as 4.5495, which gives us the linear equation y= 2.8466x+ 4.5495. We generated the dataset around the line y= 3x+ 5 by introducing noise, and the model parameters generated using SGD seem to converge with the original line. Hence, we have successfully implemented SGD in R.

For reusability, you can rewrite the code in a function as follows:

sgd_linear_regression <- function(x, y, lr = 0.01, epochs = 20) {
# Get the number of data points in the dataset
n <- length(y)
# Initialize model parameters
w <- runif(1)
b <- runif(1)
# Repeat for the given number of epochs
for (epoch in 1:epochs) {
# Update the model parameters for each data point
for (i in 1:n) {
# Select a data point (xi,yi)
xi <- x[i]
yi <- y[i]
# Predict y value using xi
y_pred <- w * xi + b
# Compute gradients using y_pred and yi assuming MSE as the loss function
dw <- -2 * (yi - y_pred) * xi
db <- -2 * (yi - y_pred)
# Update model parameters
w <- w - lr * dw
b <- b - lr * db
}
}
# Return final weight and biases
return(list(weight = w, bias = b))
}

After defining the function, you can use it to fit any dataset on linear equation y= wx+ b with MSE as the loss function as follows:

# Set random seed
set.seed(42)
# Define the number of samples
n <- 100
# Generate random x values between 0 to 10 with uniform distribution
x <- runif(n, 0, 10)
# Generate y values with noise between 0 to 2
y <- 3 * x + 5 + rnorm(n, 0, 2)
# Use SGD function to optimize model weights for linear regression
model <- sgd_linear_regression(x, y, lr = 0.01, epochs = 20)
# Print the model weights
print(sprintf("w = %.4f", model$weight))
print(sprintf("b = %.4f", model$bias))

Output:

[1] "w = 2.8466"
[1] "b = 4.5495"

We have implemented the stochastic gradient descent algorithm, which uses a fixed number of iterations to optimize the model parameters. However, the model can also converge before executing the defined number of epochs. If we continue running the algorithm after we have found the optimal model parameters, it will only be a waste of time and computational resources. Hence, we can implement an SGD algorithm that stops early if the model converges. Let’s discuss how to do so.

Implement SGD with early stopping in R

To implement SGD with early stopping, we will implement the following changes to the sgd_linear_regression() function defined in the previous section.

  • We will compute the loss function after each epoch while executing the algorithm. For this, we will use mean squared error (MSE).
  • We will define a threshold value of 10-6. If the difference between the computed loss for an epoch and the computed loss for its previous epoch is less than the threshold value, we will say that the model has converged and stop the algorithm execution.

We can implement these changes as follows:

# Define the MSE function for calculating loss after each epoch
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
sgd_linear_regression <- function(x, y, lr = 0.01, epochs = 20) {
# Get the number of data points in the dataset
n <- length(y)
# Initialize model parameters
w <- runif(1)
b <- runif(1)
# Assign negative infinity as previous loss as a placeholder
prev_loss = -Inf
# Define the threshold for an early stop
threshold = 10e-6
# Repeat for the given number of epochs
for (epoch in 1:epochs) {
# Update the model parameters for each data point
for (i in 1:n) {
# Select a data point (xi,yi)
xi <- x[i]
yi <- y[i]
# Predict y value using xi
y_pred <- w * xi + b
# Compute gradients using y_pred and yi assuming MSE as the loss function
dw <- -2 * (yi - y_pred) * xi
db <- -2 * (yi - y_pred)
# Update model parameters
w <- w - lr * dw
b <- b - lr * db
}
# Compute loss on the entire dataset after an epoch
y_preds <- w * x + b
curr_loss <- mse(y, y_preds)
# Early stopping condition
if (abs(prev_loss - curr_loss) < threshold) {
print(sprintf("Early stopping at epoch %d.", epoch))
break
}
#Assign current loss to previous loss and proceed to the next epoch
prev_loss <- curr_loss
}
# Return final weight and biases
return(list(weight = w, bias = b))
}

We can use the updated SGD function as shown in the following example:

# Set random seed
set.seed(42)
# Define the number of samples
n <- 100
# Generate random x values between 0 to 10 with uniform distribution
x <- runif(n, 0, 10)
# Generate y values with noise between 0 to 2
y <- 3 * x + 5 + rnorm(n, 0, 2)
# Use SGD with early stopping for linear regression
model <- sgd_linear_regression(x, y, lr = 0.01, epochs = 100)
# Print the model weights
print(sprintf("w = %.4f", model$weight))
print(sprintf("b = %.4f", model$bias))

Output:

[1] "Early stopping at epoch 23."
[1] "w = 2.8466"
[1] "b = 4.5497"

In this output, you can see that we have executed the SGD algorithm for 100 epochs. However, it stops after epoch number 23 as it converges to the optimal model parameters.

Applications of SGD for deep learning in R

The TensorFlow, h2o, and torch libraries allow us to use stochastic gradient descent (SGD) for training machine learning and deep learning models in R. Let’s discuss how to use SGD while training models using these libraries in R.

Using SGD with TensorFlow in R

TensorFlow provides us with the optimizer_sgd() function that we can use to create SGD optimizers to train models. To train models using the SGD algorithm in TensorFlow, we first create an SGD optimizer using the optimizer_sgd() function. Then, we assign this optimizer to the optimizer parameter in the compile() function as follows:

library(keras)
library(tensorflow)
# Define model
model <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = 784) %>%
layer_dense(units = 10, activation = "softmax")
# Compile with SGD optimizer
model %>% compile(
optimizer = optimizer_sgd(learning_rate = 0.01),
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
# Train model
model %>% fit(x_train, y_train, epochs = 5, batch_size = 32, validation_split = 0.2)

In this code, we have used the ReLU activation function and the SGD optimization algorithm to train the model using the TensorFlow library.

Using SGD with h2o library in R

We can also use the SGD optimizer while training deep learning models with the h2o library. We use the solver parameter in the h2o.deeplearning() function to do this.

While training a model using the h2o library, we can set the solver parameter to "SGD" in the h2o.deeplearning() function, as shown in the following example:

library(h2o)
h2o.init()
# Train deep learning model using SGD
model <- h2o.deeplearning(x = 1:4,
y = 5,
training_frame = iris_h2o,
activation = "Rectifier",
hidden = c(10, 10),
epochs = 20,
solver = "SGD")

Using SGD with the torch library in R

We can also use stochastic gradient descent using the optim_sgd() function from the torch library in R. For instance, we can use the optim_sgd() to define an optimizer for training a neural network model as follows:

library(torch)
library(dplyr)
# Define a simple feedforward neural network
net <- nn_module(
initialize = function() {
self$fc1 <- nn_linear(4, 10)
self$fc2 <- nn_linear(10, 3)
},
forward = function(x) {
x %>%
self$fc1() %>%
nnf_relu() %>%
self$fc2()
}
)
model <- net()
# Define loss function
criterion <- nn_cross_entropy_loss()
# Define SGD optimizer
optimizer <- optim_sgd(model$parameters, lr = 0.01)
# Training loop
for (epoch in 1:100) {
optimizer$zero_grad()
output <- model(x)
loss <- criterion(output, y)
loss$backward()
optimizer$step()
}

In this code, we have used the optim_sgd() function to train models using the SDG optimizer and the torch library in R.

We have discussed the implementation and applications of the stochastic gradient descent algorithm in R. Now, let’s discuss the advantages and disadvantages of SGD.

Advantages and disadvantages of stochastic gradient descent

Updating the model parameters based on individual data points gives SGD several advantages over gradient descent. However, it also leads to some limitations. Let’s first look at the advantages of stochastic gradient descent.

  • In gradient descent, we traverse the entire training dataset to compute the gradient for every update. SGD updates the model parameters using just one sample at a time, which makes it faster.
  • As SGD updates the model parameters more frequently. This leads to faster learning at the start of the model training process and helps the model converge quickly.
  • In gradient descent, we load the entire dataset into memory, whereas SGD needs only one data point at a time. This makes SGD ideal for online learning, where data arrives in real-time. It also helps us train ML models using very large datasets.
  • The parameter updates in SGD are noisy and can skip the local minima. For non-convex problems with multiple local minima, the noise can help the model avoid local minima and explore the loss function, leading to global minima and a better solution.

Along with its advantages, SGD also has certain disadvantages:

  • As SGD calculates the gradient of the loss function using a single data point, the parameter updates are noisy. Due to this, the model struggles to converge smoothly. Compared to SGD, gradient descent has a smoother convergence as the gradients are calculated on the entire training dataset.
  • Parameter updates in SGD are more sensitive to the learning rate α. The model training process slows down if α is too small. If α is too high, the model parameters may fail to converge. Hence, we must carefully choose α while using stochastic gradient descent as the optimization algorithm for training models.
  • As the name suggests, SGD is a stochastic algorithm. Hence, if we train a model multiple times using SGD, we might not always get the same optimal model parameters.

Conclusion

Stochastic gradient descent is one of the cornerstone algorithms in machine learning and deep learning. Due to its simplicity and scalability, it is a go-to algorithm for training complex models with large datasets. In this article, we discussed the basics of gradient descent and stochastic gradient descent. We also discussed the implementation and applications of SGD.

To build upon the concepts discussed in this article, you can take this course on Linear regression with R. You might also like this course on the Basics of Causal Inference with R that discusses how to figure out how different variables in the dataset influence your results.

Frequently asked questions

1. What are the three types of gradient descent?

The three types of gradient descent are batch gradient descent (also called gradient descent or vanilla gradient descent), stochastic gradient descent (SGD), and mini-batch gradient descent.

2. What is the difference between stochastic and vanilla gradient descent?

Vanilla gradient descent computes the gradients using the entire training dataset and updates the model parameter once in each epoch. Stochastic gradient descent computes the gradient for each data point and updates the model parameters multiple times in an epoch.

3. Which is faster, SGD or GD?

Stochastic gradient descent (SGD) is faster than gradient descent (GD) as it updates model parameters using individual data points and converges rapidly during model training.

4. Why is stochastic gradient descent better?

Stochastic gradient descent is memory efficient and updates the model parameters using one data point at a time. This helps in building complex models with large datasets at scale.

5. Is Adam optimizer faster than SGD?

Yes. The Adam optimizer is faster than SGD. It is one of the most popular optimization algorithms for training deep learning models due to its faster convergence speed.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team