A Guide to Gradient Descent and Stochastic Gradient Descent

Mahmoud Ayach
Tech Blog
Published in
4 min readSep 24, 2023

Gradient Descent is often hailed as the premium algorithm to solve machine learning problems and find the optimal Machine Learning model. I’ve always heard about its significance in my classes, which piqued my curiosity and led me to dive deep into its mechanics.

Photo by Codioful (Formerly Gradienta) on Unsplash

Gradient Descent: The Navigational Compass of ML

Gradient Descent is an optimization algorithm that minimizes the error function by iteratively moving towards the minimum. It is the backbone of many ML algorithms, guiding them towards the best set of parameters.

How it Works?

Imagine you are on a foggy hill, and your goal is to reach the valley below, the path of least elevation. You decide your next step by feeling the slope of the ground beneath you. This is analogous to how Gradient Descent works. It calculates the gradient of the loss function concerning its parameters and moves toward the steepest decrease.

In a machine learning context, where l is the loss and w represents the model parameters, the update rule is given by:

w = w — α ∇ l(w)

Here, α is the learning rate, and l(w) is the gradient of the loss function l concerning the parameters.

Model updates up to theconvergence to the optimal solution.

Python Code for Gradient Descent in Simple Linear Regression

import numpy as np

# Define the dataset
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Initialize parameters
alpha = 0.01 # Learning rate
epochs = 1000 # Number of iterations
w = 0 # Model parameter

# Perform Gradient Descent
for epoch in range(epochs):
y_pred = w * X
gradient = (-2/len(X)) * sum(X * (y - y_pred))
w = w - alpha * gradient

print("Optimal parameter is: w =", w)

Step Size choice

The learning rate α is a crucial hyperparameter in Gradient Descent. It determines the size of the steps we take towards the minimum. Here are some guidelines for choosing the learning rate:

  • Small: A minimal learning rate will prevent the model from learning very slowly, leading to a long training time.
  • Too Large: A considerable learning rate can cause the model to overshoot the minimum and potentially cause the algorithm to diverge.
  • Adaptive Learning Rate: Some techniques adjust the learning rate during training. For example, reducing the learning rate as the number of iterations increases can help the algorithm converge more reliably.

It is common practice to experiment with different learning rates and observe their effect on the algorithm’s convergence.

Stochastic Gradient Descent: A Variant with a Twist

Stochastic Gradient Descent, known as SGD, is a variant of Gradient Descent that updates the model’s parameters using only a single sample at each iteration, making it more efficient and faster, especially for large datasets.

Why SGD is Better?

  1. Efficiency: SGD is computationally less intensive, using only one sample to perform updates.
  2. Noise: The noise in the updates can help escape local minima and saddle points, potentially leading to better solutions.

The update rule for SGD is similar to that of Gradient Descent but is applied using a single sample selected randomly at each iteration:

w = w — α ∇ l(w;xᵢ)

where ∇ l(w;xᵢ) is the loss for the i-th sample.

Stochastic Aspect of SGD: Random Sample Selection

The essential characteristic that differentiates Stochastic Gradient Descent (SGD) from the standard Gradient Descent is the selection of samples at each iteration. In SGD, instead of using the entire dataset to compute the gradient of the loss function, a single sample is randomly selected in each iteration to perform the update.

This stochastic or random selection of samples introduces variability in the updates, which can have several effects, such as the potential to escape local minima and faster convergence, especially on large datasets.

Python Code for Stochastic Gradient Descent in Simple Linear Regression

import numpy as np

# Define the dataset
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Initialize parameters
alpha = 0.01 # Learning rate
epochs = 1000 # Number of iterations
w = 0 # Model parameter

# Perform Stochastic Gradient Descent
for epoch in range(epochs):
for i in range(len(X)):
y_pred = w * X[i]
gradient = -2 * X[i] * (y[i] - y_pred)
w = w - alpha * gradient

print("Optimal parameter is: w =", w)

Why does SGD Work? A Look at Variance…

While it might seem counterintuitive to use only one sample for updates, SGD works because, on average, the updates are in the correct direction.

The key is to ensure that the variance in the updates is manageable, which could make the path towards the minimum noise. This is typically controlled by gradually reducing the learning rate, allowing SGD to converge effectively.

Conclusion

Gradient Descent and its variant, Stochastic Gradient Descent, are fundamental algorithms in Machine Learning. They navigate through the parameter space and guide models towards optimal solutions.

Understanding their mechanics, mathematical formulations, and applications is essential for anyone diving into the Machine Learning world.

--

--

Mahmoud Ayach
Tech Blog

Data science and machine learning enthusiast. Delving deep into AI insights and sharing compelling narratives on Medium.