An Introduction to Gradient Descent: A Powerful Optimization Algorithm

15 min readFeb 12, 2023

Gradient Descending is an essential optimization algorithm used heavily in machine learning! It is an important and flexible tool that allows us to find the optimal values of the model parameters to minimize the cost function. Whether you're training a simple linear regression model or a complex neural network, gradient descent is an algorithm you need to understand and master. It is the primary driving force of many of the most successful machine learning models and is a crucial element of the training process for many models.

Table of Content

1. Introduction:
- Definition of gradient descent and why it is essential in machine learning
- Brief overview of the different types of gradient descent algorithms (e.g., batch, stochastic, mini-batch)

2. The mathematics behind gradient descent:
- Explanation of the cost function
- Introduction to derivatives and how they are used to find the slope of the cost function
- Description of the gradient descent algorithm and how it uses partial derivatives to update the model's parameters

3. Examples of gradient descent in action:
- Use a simple linear regression model to demonstrate how gradient descent is used to optimize the model's parameters
- Linear regression model from scratch in Python

4. Conclusion

Introduction

Definition of gradient descent and why it is essential in machine learning

Gradient descent is an optimization algorithm that's used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (J). It's a first-order iterative optimization algorithm used to find the minimum of a function.

In machine learning, gradient descent is used to update the parameters of a model to minimize the error between the predictions and the actual output. This is important because the critical thing about machine learning algorithms is making accurate predictions on unseen data. We can improve the model's accuracy by minimizing the error between the inference and the actual output.

There are several types of gradient descent algorithms, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Every kind of gradient descent algorithm has unique characteristics, and the applicable algorithm will depend on the specific requirements of the machine learning problem.

In summary, gradient descent is an important optimization algorithm widely used in machine learning to improve the accuracy of predictive models. It works by iteratively optimizing the parameters of a model to minimize the error between the expected and the actual output.

A brief overview of the different types of gradient descent algorithms (e.g., batch, stochastic, mini-batch)

Below is a brief overview of the three most common types of gradient descent algorithms:

Batch gradient descent: In batch gradient descent, the gradient of the cost function is calculated for the entire training dataset, and the parameters are updated consequently. This process is repeated until the parameters converge to a minimum cost function value. One of the main advantages of batch gradient descent is that it's guaranteed to converge to the global minimum of the cost function, assuming that the cost function is convex. Still, it can be slow to converge, especially for large datasets.

Stochastic gradient descent: In discrepancy to batch gradient descent, stochastic gradient descent only uses a single training example to calculate the gradient of the cost function at each step. This means that the parameters are updated more frequently, which can lead to faster convergence. However, the updates are noisier, which can make the optimization process more unstable. In addition, stochastic gradient descent is prone to get stuck in local minima, as it doesn't consider the entire dataset when updating the parameters.

Mini-batch gradient descent: Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent. It uses a small "mini-batch" of training samples to calculate the gradient of the cost function at each step, rather than the entire dataset (like batch gradient descent) or a single sample (like stochastic gradient descent). This can lead to brisk confluence and more stable updates than stochastic gradient descent. However, it's still prone to get stuck in local minima, as it only considers part of the dataset when updating the parameters.

The choice of gradient descent algorithm will depend on the specific requirements of the machine learning problem at hand. Batch gradient descent is guaranteed to meet the global minimum but can be slow to converge. Stochastic gradient descent is fast to converge but can be unstable and prone to local minima. Mini-batch gradient descent is a good compromise between the two but is still prone to local minima.

The mathematics behind gradient descent

To understand how gradient descent works, it is essential first to understand the concept of Cost Function and Derivative.

Explanation of the Cost Function in the contest of Gradient Descent algorithm

A cost function (also known as loss function or objective function) is a function that measures the distance between the observed values and the values predicted by the machine learning model. The goal of a model is to minimize the cost function, which means to find the values of the model parameters that result in the overall smallest difference between the predicted outputs and the actual outputs. There are many different cost functions, and choosing the specific form depends on the type of machine learning problem to be solved; for example, in supervised problems, we can use mean squared error for regression and cross-entropy for classification.

We can see a representation of the error between actual and predicted values in the below graph. The blue dots are the observations. The black line is our prediction line (with equation y = mx + b where m and b are the parameters to be tuned by the model with the support of the cost function) that passes through the observed points and fits them in the best way; it contains the predicted points. The red dot on the black line is our prediction for the y_i observation. The light blue line represents the error between the observation y_i and the prediction y˄_i.

The cost function works by averaging the losses over all the samples in the data, giving the machine learning model feedback so that it may change the parameters to reduce error through the optimization process.

A cost function used in gradient descent should have the following characteristics:
1. It should be continuous and differentiable: It is necessary to calculate the cost function's gradient and use it to update the model's parameters.
2. It should be convex: A convex cost function has a single global minimum, which makes it easier to find the optimal result using gradient descent. If the cost function is not convex, it may have multiple local minima, making it more challenging to find the global minimum.

Overall, a reasonable cost function should be continuous, differentiable, convex, smooth, and well-behaved to ease the convergence of the gradient descent algorithm to the optimal solution.

As we said, there are numerous different cost functions in machine learning. Each has its use cases, e.g., depending on whether it is a regression problem or a classification problem. Still, one of the most popular is Mean Squared Error (MSE). The MSE is ideally suited to optimization via gradient descent precisely because it has all the above characteristics.

Below we can see the plot of a parabola as a graphical representation of the MSE function.

Mean Squared Error (MSE)

Mean Squared Error (also known as Mean Squared Deviation or L2-error) is a commonly used cost function in regression problems. It measures an estimator's quality by averaging the squared difference between the observed and predicted values. The output is a single number ranging from 0 to infinity, with a value of 0 representing a perfect prediction and larger values indicating a less accurate model, i.e., the lower the MSE, the better the model fits the data.

The MSE takes the sum of squared differences between the observed values and the corresponding predicted values and divides that sum by the number of observations, i.e., it takes the sum of squared errors (SSE) and divides it by the sample size.

Introduction to derivatives and how they are used to find the slope of the cost function

Derivative

As we said before, a cost function should be differentiable in the sense that it should have a derivative at every point. This is important because, through the derivative, the gradient descent optimization algorithm is able to determine in which direction the weight needs to be adjusted to gain a step towards the minimum of the cost function, therefore carrying a lower error among predicted and actual output.

The derivative tells us the function's rate of change (the slope) at a particular point (tangent point). Finding the slope of a function tells us the direction of the function at that specific point; for example, the slope of a linear function tells us the direction in which the function develops.

To find the slope of a function in a particular point, we must first find the slope of the secant line between the point of interest and another point on the curve.

Partial Derivatives and Gradients

A partial derivative is the derivative of a function with respect to one variable holding all other variables constant. In other words, it represents the rate of change of a function with respect to a single variable. Partial derivatives are used to determine how a change in one variable affects the overall function.

Overall, the gradient is a key concept in machine learning, as it allows us to optimize the model’s parameters and improve the performance of our models.

Description of the gradient descent algorithm and how it uses partial derivatives to update the model’s parameters

Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.
Arthur Samuel — Artificial Intelligence: A Frontier of Automation

Arthur Samuel described machine learning as an automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and a mechanism for altering the weight assignment to maximize performance.

This is precisely what is done with the gradient descent algorithm, where the automatic mean of testing is the cost function, and the mechanism for altering the weight is the gradient descent optimization process. Indeed, in GD, we first initialize to random values our model parameters, define our cost function, and calculate the partial derivatives of it with respect to each parameter. The relative gradient is then used to update the model's parameters using the gradient descent algorithm, which involves taking small steps in the direction of the negative gradient. We need to move in the opposite direction of the gradient since the gradient points up the slope instead of down it, so we move in the opposite direction to try to decrease our error until the parameters converge to a minimum value of the cost function. At this point, the model is said to be trained, and the parameters can be used to make predictions on unseen data.

This process is repeated until the parameters converge to a minimum cost function value, where we have trained our model and are ready for inference on unseen data.

In summary, these are the steps needed to turn a function into a machine learning classifier using the gradient descent optimization process:

Initialize the weights.
Use these weights to make predictions for each training sample.
Calculate the model's performance based on these predictions using the cost function.
Calculate the cost function gradient to determine how to adjust each weight to minimize loss.
Update all weights based on this calculation.
Repeat steps 2 to 5.
Continue until the training process is stopped (for example, when the model is accurate enough or the desired iteration amount has been exceeded).

We can see an example of the process in the code snippet below. The example uses a simple function f(x) = x² and its gradient f’(x) = 2x. The initial value of x is set to 5, and the learning rate is set to 0.1. The algorithm performs 20 iterations of gradient descent, and the results are plotted using matplotlib.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Define the function to be minimized
def func(x):
    return x**2

# Define the gradient of the function
def grad(x):
    return 2*x

# Define the initial value of x and the learning rate
x = 5
lr = 0.1

# Create an empty list to store the values of x
x_list = []

# Create an empty list to store the values of the function
y_list = []

# Perform gradient descent
for i in range(20):
    x_list.append(x)
    y_list.append(func(x))
    x = x - lr*grad(x)

# Plot the results
plt.scatter(x_list, y_list)
plt.text(3, 20, 'GD\noptimizazion\nprocess')
plt.annotate(' ', xy=(1, 2.5), xytext=(4., 20),
             arrowprops=(dict(connectionstyle="arc3,rad=-.1")))
plt.xlabel("weight")
plt.ylabel("loss")
sns.despine()
plt.title("Gradient Descent")
plt.savefig('GD optimization process')
plt.show()

Examples of gradient descent in action

A simple linear regression model to demonstrate how gradient descent is used to optimize the model’s parameters

Now that we have a basic understanding of gradient descent and its mathematics let’s look at an example of how it can be used in practice.

Linear regression is a simple and widely used statistical model to predict a continuous outcome variable y given one or more input variables x.

In this section, we will build a simple univariate linear regression model from scratch in Python, using the gradient descent algorithm to optimize the model’s parameters.

Our goal is to minimize the MSE. Minimizing the cost function is the exact task of the Gradient Descent algorithm. It takes the parameters m and b and tunes them till the local minimum is reached by taking partial derivatives of MSE with respect to these parameters.
Parameters are then updated by iterating through our derived functions and gradually minimizing MSE.

In this process, we need to initialize a critical (hyper)parameter, the learning rate; it defines the length of steps we take toward updating the parameters at each iteration and helps our model converge nicely over the minimum point of the MSE function.

The partial derivatives with respect to m mean we derive parameter m and ignore what is going on with b. To take partial derivatives, we need to use the chain rule. We use the chain rule when we need to take a derivative of a function that contains another function inside. Indeed, in our MSE equation, we have two functions: an outside function ()² that contains the inside function y — (mx + b).

The chain rule says that we have to take the derivative of the outside function, leaving the inside function as is, times the derivative of the inside function:

Linear regression model from scratch in Python

We now have our cost function and our gradient; then, we can see an empirical example of the process of the Gradient Descent algorithm for linear regression model in python code.

Here are the steps we are going through:

1. Define some random data

2. Define the model:

y = mx + b

2. Initialize the model parameters (weights):
* m as the slope of our linear function
* b as the intercept

3. Initialize the hyperparameters:
* learning rate (alpha)
* iterations, i.e., the number of iterations we want to perform

4. Define the Cost Function: The function to measure the performance of the model, which for us is MSE.
5. Use initialized weights to make predictions.

6. Calculate the model's performance based on these predictions using the cost function.
7. Calculate the gradient of the cost function

8. Optimize the parameters with the gradient descent algorithm: Once we have calculated the gradient of the MSE, we can use it to update the values of m and b using the gradient descent.
9. Train the model: To train the model, we will repeat the steps from 6 to 9 for the fixed number of iterations or until the parameters converge to a minimum value of the MSE.

With this model in mind, let’s start by defining some functions in Python.

The function we are going to create are:
- st_scale: This function standardize the input data to have mean 0 and standard deviation 1.
- plot_regression: Plots the linear regression model with a scatter plot of data points.
- cost_function: Calculates the mean squared error between the actual and predicted values.
- gradient: Calculates the gradient of the mean squared error cost function with respect to the line of best fit’s parameters (m and b).
- gradient_descent: Optimizes the line of best fit for a given set of input feature values X and actual target values y using gradient descent.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Define the function to scale data
def st_scale(x):
    mean = np.sum(x) / len(x)
    std = np.sqrt(np.sum((x - np.mean(x))**2) / len(x))
    return pd.Series([(i - mean)/std for i in x])

# Function to plot data and regression lines
def plot_regression(X, y, y_pred, log=None, title="Linear Regression"):
    plt.figure(figsize=(16,6))
    plt.rcParams['figure.dpi'] = 107
    plt.scatter(X, y, c='#388fd8', s=6, label='Data')
    if log != None:
        for i in range(len(log)):
            plt.plot(X, log[i][0]*X + log[i][1], lw=1, c='#caa727', alpha=0.35)
    plt.plot(X, y_pred, c='#ff7702', lw=3, label='Regression')
    plt.title(title)
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend(frameon=True, loc=0, fontsize=10, borderpad=.6)
    plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6)
    plt.show()

# Define the cost function MSE
def cost_function(X, y, m, b):
    N = len(y)
    y_pred = m * X + b
    MSE = 1/N * np.sum((y - y_pred)**2)
    return MSE

# Define the graient for each parameter m and b
def gradient(X, y, m, b):
    N = len(y)
    grad_m = 2/N * np.sum(X * (m * X + b - y))
    grad_b = 2/N * np.sum(m * X + b - y)
    return grad_m, grad_b

Now that we have defined the cost function and the gradient of the cost function, we can define the gradient descent algorithm:

# Gradient descent
def gradient_descent(X, y, m, b, lr, num_iters):
    log, mse = [], []
    for i in range(num_iters):
        grad_m, grad_b = gradient(X, y, m, b)
        m -= lr * grad_m
        b -= lr * grad_b
        log.append((m, b))
        mse.append(cost_function(X, y, m, b))
    return m, b, log, mse

Now let’s generate some data and initialize the seed to have reproducible outputs.

# set seed
np.random.seed(42)
# Generate some synthetic data for X and y
X = pd.Series(np.random.randn(10000)*8)
y = pd.Series(np.random.randn(10000)*1.5 + 0.85 * X)

For linear regression models, scaling the data is essential as it ensures that all variables have similar ranges, which can improve the optimization process and prevent some variables from dominating the objective function. Certain variables with large ranges without scaling may lead to a suboptimal solution or even result in slow convergence or divergence.

Conversely, it is not necessary to scale the target variable. However, if the target variable has a considerable scale compared to the input variables, as in our case, this can impact the optimization process and slow down convergence. The decision to scale the target variable ultimately depends on the specific problem and the range of the target variable.

Let’s proceed to scale both the independent and the dependent variables.

# Scale the data to mean 0 and std 1
X = st_scale(X)
y = st_scale(y)

# plot the data
plt.figure(figsize=(16, 8))
plt.scatter(X, y, s=1);

Ok, we have our data; we have defined our model, the cost function to measure performance, and the optimization process to adjust model parameters to fit the data; we are ready to train our model on data.

First, we initialize the model parameters m and b and set the hyperparameter's learning rate and the number of iterations.

# Initialize the model parameters
m = 0
b = 0

# Set the hyperparameters
lr = 0.05
num_iters = 100

Everything is ready to train the model; let’s get started.

# Train the model
m, b, log, mse = gradient_descent(X, y, m, b, lr, num_iters)

# Print the trained parameters
print(f"m = {m}, b = {b}, mse={mse[-1]}")

m = 0.9765249747764144, b = -2.784661390364819e-17, mse=0.0463483143378596

We can see that the line of best fit has slope approx 0.98 and intercept approx -2.8. The MSE for the line is approx 0.05, which is a good result!!

With the plot_regression function, we can visualize the optimization process by gradient descent which, by updating the parameters through each iteration, eventually finds the line of best fit.

# define the prediction with the parameters of best fit
y_pred = m*X + b

# plot data and fit lines
plot_regression(X, y, y_pred, log=log)

Below we can see how the MSE gradually decreases after each iteration. The MSE starts from around 0.8 and progressively drops to approximately 0 as the GD process updates the parameters.

# Plot the Gradient Descent Optimization
plt.figure(figsize=(16, 3))
plt.plot(range(len(mse)), mse)
plt.title('Gradient Descent Optimization')
plt.xlabel('Epochs')
plt.ylabel('MSE')
plt.show()

Conclusion

In this blog post, we introduced gradient descent, a powerful optimization algorithm widely used in machine learning.

I hope you found this article helpful! Let me know what you think, especially if there are suggestions for improvement. You can connect with me on LinkedIn: https://www.linkedin.com/in/fabrizio-ferrari-b474a920/