Demystifying Gradient Descent in Machine Learning

5 min readFeb 11, 2024

This article is part of the series Demystifying Machine Learning.

Introduction

Imagine you’re training a machine learning model to predict housing prices. How do you know if the model is doing a good job? You need a way to measure the error — the difference between your model’s predictions and the actual house prices. This is where the loss function comes in.

The Loss Function: How Well the Model Fits the Data

A loss function is a mathematical tool that quantifies the error of a machine learning model. It essentially calculates how badly the model is off the mark on a single example or, more commonly, across a set of examples. Common loss functions include mean squared error (for regression problems) and cross-entropy loss (for classification problems). The goal of any machine learning algorithm is to find the model parameters that minimize this loss function.

For example, the most common loss function for linear regression is Mean Squared Error (MSE):

MSE = (1/n) * Σ(y_i - ŷ_i)^2

Where:

n is the number of data points
y_i is the actual value of the dependent variable for data point i
ŷ_i is the predicted value of the dependent variable for data point i

ŷ_i is calculated using the linear regression model:

ŷ_i = w · x_i = w0 * 1 + w1 * x1_i + w2 * x2_i + ... + wm * xm_i

where w0, w1, ...,wm are the model's parameters and x1_i ...., xm_i are features of the i-th data point which are independent variables.

Gradient: The Direction to Lower Error

Imagine the loss function as a mountainous landscape. Your model’s parameters determine your position on that landscape, and the value of the loss function is your altitude. Gradient descent aims to lead you down into the lowest valley, minimizing error. But how do you find the downhill direction? This is where the gradient comes in.

In mathematics, the gradient of a function with multiple variables (like our loss function) is a vector. Each component of this vector represents a partial derivative, that is, the rate of change of the function with respect to one specific variable.

The beauty of the gradient lies in its direction: it always points in the way of steepest increase of the function. Following the negative of the gradient is the quickest way downhill, toward minimizing our error.

Continue with the example of linear regression. The loss function is

MSE = (1/n) * Σ(y_i - ŷ_i)^2

The gradient with respect to weights (w) is

∇_w MSE = (-2/n) * Σ (y_i - ŷ_i) * x_i

Where:

∇_w denotes the gradient with respect to the weights w
n is the number of data points
y_i is the true value of the dependent variable for data point i
ŷ_i is the predicted value of the dependent variable for data point i
x_i is the vector of input features for data point i

The gradient is a vector with the same dimensions as the weight vector w. Each element in the gradient vector represents the partial derivative of the MSE loss with respect to a particular weight.

Gradient Descent: Step by Step in the Direction of Gradient

Gradient descent is an iterative process of optimizing the parameters of the model by following the direction of gradient. It is like your own intelligent compass for navigating the error landscape defined by the loss function. Here’s how it works:

Initialization: You kick things off with random values for the model’s parameters.
Evaluate the Loss: The loss function tells you how “wrong” your model is, providing a sense of how far you are from the “lowest valley.”
Calculate the Gradient: The gradient of the loss function is like a signpost. It points out the direction of steepest increase. Importantly, its negative points the way downhill toward error reduction.
Update Parameters: Inching “downhill,” you update the model parameters a bit in the opposite direction of the gradient. The learning rate controls how big these steps are.
Repeat: You continue calculating gradients and updating parameters until convergence or a suitable stopping point.

Challenges Along the Way: Local Minima

Think of the error landscape as a complex terrain with hills and valleys. Gradient descent could get stuck in a local minimum, a “valley” that isn’t the lowest point overall. It thinks it’s found the best spot but hasn’t explored the entire landscape.

Overcoming Obstacles

Various strategies help mitigate the issue of local minima:

Momentum: Momentum techniques help gradient descent to power through small bumps and out of local minima.
Stochastic Gradient Descent (SGD): Instead of considering the entire dataset in each step, SGD uses individual examples or small batches of data, introducing noise that can help escape local minima.
Adaptive Learning Rates: Methods like Adagrad or Adam can adjust the learning rate over time, potentially aiding in navigating tricky terrain.

Example: Gradient Descent for Linear Regression

The code is available in this colab notebook:

import numpy as np
import matplotlib.pyplot as plt


def compute_gradient(X, y, theta):
    """Calculates the gradient of the loss function for linear regression.

    Args:
        X: The feature matrix.
        y: The target values.
        theta: The current model parameters.

    Returns:
        The gradient vector.
    """

    m = X.shape[0]
    prediction = np.dot(X, theta)
    error = prediction - y
    gradient = (2/m) * np.dot(X.T, error)
    return gradient

def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    """Performs gradient descent to optimize linear regression parameters.

    Args:
        X: The feature matrix.
        y: The target values.
        learning_rate: The step size for updating parameters.
        iterations: The number of iterations to run gradient descent.

    Returns:
        The optimized model parameters (theta).
    """

    m = X.shape[0]
    theta = np.zeros(2)  

    for _ in range(iterations):
        gradient = compute_gradient(X, y, theta)
        theta -= learning_rate * gradient 
    return theta


# Generate sample linear regression data
np.random.seed(42)  # For reproducibility
n = 100
X = 2 * np.random.rand(n, 1)
# y = mx + b + noise
y = 3 + 5 * X + np.random.randn(n, 1)
# Reshap y from (n,1) to (n,)
y = y.flatten()

# Add a column of ones for the intercept term
X_b = np.c_[np.ones((n, 1)), X] 

# Run gradient descent
theta = gradient_descent(X_b, y)
print(f"Theta (Intercept, Slope): {theta}")

# Evaluation (Using Mean Squared Error)
predictions = np.dot(X_b, theta)
mse = np.mean((predictions - y) ** 2)
print(f"Mean Squared Error: {mse}")

# Visualization
plt.figure(figsize=(10,6))
plt.scatter(X, y, s=30)
plt.plot(X, predictions, color='red', linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression with Gradient Descent")
plt.show()

Output:

Theta (Intercept, Slope): [3.31755474 4.67964802]
Mean Squared Error: 0.8097550788165961

Conclusion

Gradient Descent is a fundamental algorithm in the machine learning toolkit. By iteratively moving towards the minimum of the cost function, it allows models to learn from data. Understanding and implementing Gradient Descent is a critical step for anyone embarking on a journey in machine learning. With its wide applicability and efficiency, Gradient Descent remains a go-to method for optimizing models across a broad spectrum of machine learning tasks.