“Gradient Descent: The Backbone of Machine Learning Optimization”

9 min readAug 13, 2023

What is Gradient Descent?

• Gradient descent is an iterative optimisation algorithm used in machine learning and deep learning including linear regression to update the parameters (slope and intercept) of a linear equation based on the gradients of the cost function.

• The cost function measures the discrepancy between the predicted values and the actual target values.

• In each iteration, the parameters are adjusted in the direction that reduces the cost function, aiming to find the line that best fits the given data.

In the context of linear regression, gradient descent is employed to find the best-fitting line that minimises the difference between predicted and target values.
In summary, gradient descent is the process of iteratively fine-tuning the parameters of a linear regression model to minimize the error between predicted and actual values, resulting in a line that accurately describes the relationship between input features and target values.

Parameters For gradient Descent:

• Parameters refer to the variables or coefficients of a model that are adjusted during the optimization process to minimize a cost function.

• These parameters define the characteristics of the model and determine its behaviour. In the case of linear regression, parameters usually correspond to the slope and intercept of the linear equation.

Let’s consider the parameters in the context of linear regression and gradient descent:

• Slope (m): The slope represents the change in the dependent variable (y) for a unit change in the independent variable (x). In a linear regression model, the slope determines the steepness of the line.

• Intercept (b): The intercept is the point where the line crosses the y-axis when x is 0. It contributes to shifting the line-up or down on the y-axis.

What is the Cost function/Loss Function?

In the context of gradient descent, the loss function, also known as the cost function or error function, is a mathematical representation of how well a machine learning model’s predictions match the actual target values.
The goal of gradient descent is to minimize this loss function, which indicates how far off the model’s predictions are from the true values.
For linear regression, the most common loss function used is the Mean Squared Error (MSE).
The MSE calculates the average of the squared differences between the predicted values and the actual target values. Mathematically, the MSE is defined as:

Where:

n is the number of data points
yᵢ is the actual target value for the i-th data point
ŷᵢ is the predicted value for the i-th data point
The smaller the value of the MSE, the better the model’s predictions align with the actual data.
In other words, minimizing the MSE results in finding the best-fitting line that describes the relationship between the input features and the target values.
During each iteration of the gradient descent algorithm, the loss function is evaluated using the current set of model parameters (such as the slope and intercept in linear regression).
The calculated loss is then used to determine the direction and magnitude of parameter updates that will reduce the loss in the next iteration.
The goal is to iteratively adjust the parameters in a way that minimizes the loss function, thus improving the accuracy of the model’s predictions.
It’s important to note that while the Mean Squared Error is commonly used for linear regression, different types of machine learning models and tasks might require different loss functions.
The choice of the loss function depends on the characteristics of the problem you’re trying to solve and the specific goals of your model.

How does Gradient Descent work?

Steps of gradient descent in the context of linear regression:

Initial Parameters: Before starting the gradient descent process, you need to initialize the parameters of the linear equation. In a simple linear regression (with one input feature), you have two parameters: the slope (m) and the intercept (b). These parameters define the initial line on the scatter plot of your data.
Define a Cost Function: The cost function quantifies how far off your predictions are from the actual target values. In linear regression, the Mean Squared Error (MSE) is a commonly used cost function. It’s defined as the average of the squared differences between the predicted values and the actual values:

MSE = (1/n) * Σ(yᵢ — ŷᵢ)²
where n is the number of data points, yᵢ is the actual target value for the i-th data point, and ŷᵢ is the predicted value for the i-th data point.

3. Iterative Process: The heart of gradient descent lies in its iterative nature. The algorithm repeatedly adjusts the parameters to minimize the cost function.

4. Gradient Calculation: To update the parameters, you need to calculate the gradients of the cost function with respect to each parameter. The gradient represents the direction of the steepest increase of the cost function. In mathematical terms:

∂MSE/∂m = (2/n) * Σ(-xᵢ) * (yᵢ — ŷᵢ)

∂MSE/∂b = (2/n) * Σ(yᵢ — ŷᵢ)

Here, xᵢ is the input feature for the i-th data point.

5. Update Parameters: Once you’ve calculated the gradients, you update the parameters using the learning rate. The learning rate determines how big of a step you take in the direction of the negative gradient. If the learning rate is too small, convergence might be slow. If it’s too large, you might overshoot the optimal values.

new_slope = old_slope — learning_rate * ∂MSE/∂m

new_intercept = old_intercept — learning_rate * ∂MSE/∂b

6. Repeat: Repeat steps 4 and 5 for a certain number of iterations or until the parameters converge to a stable point. Convergence is typically determined by checking if the change in the parameters is below a certain threshold or if the cost function no longer decreases significantly.

Throughout the iterations, the parameters are adjusted in a way that gradually brings the predicted values closer to the actual target values, minimizing the cost function. As the cost decreases, the line defined by the parameters fits the data more closely, resulting in a better linear regression model.
Remember that gradient descent is a foundational optimization technique in machine learning, and while I’ve explained it in the context of simple linear regression, it’s used in more complex models and neural networks as well. The choice of learning rate, the number of iterations, and other hyperparameters can significantly impact the effectiveness and efficiency of the gradient descent process.

What is the Learning rate?

The learning rate is a hyperparameter in the gradient descent algorithm that determines the step size taken in the direction of the gradient when updating the model parameters.
It’s a critical parameter that influences the convergence speed and stability of the optimization process.
In the context of gradient descent, as you iteratively update the parameters to minimize the cost function, the learning rate controls how big of a step you should take in each iteration.

Here’s what you need to understand about the learning rate:

1. Large Learning Rate: A high learning rate can lead to faster convergence initially, as the algorithm takes larger steps towards the optimal solution. However, if the learning rate is too high, it can lead to overshooting the minimum of the cost function and causing the optimization to diverge or oscillate around the optimal point.

2. Small Learning Rate: A small learning rate leads to slower convergence because the algorithm takes smaller steps. While this might be safer in terms of avoiding overshooting, it might result in very slow convergence and require more iterations to reach the optimal solution.

3. Choosing the Right Learning Rate: Selecting an appropriate learning rate is crucial. A common approach is to start with a moderate learning rate and then adjust it based on the performance during training. You can also use techniques like learning rate schedules or adaptive methods to dynamically adjust the learning rate during training.

4. Hyperparameter Tuning: The learning rate is a hyperparameter, which means it needs to be chosen before training begins and is not learned from the data. It’s typically determined through experimentation and hyperparameter tuning. Cross-validation or grid search can be used to find the learning rate that provides the best convergence and performance.

The learning rate is just one of the hyperparameters that can affect the performance of the gradient descent algorithm. Finding the right balance between a learning rate that converges efficiently and avoids overshooting is essential for successfully optimizing machine learning models using gradient descent.

What is Epoch in Gradient Descent?

an “epoch” refers to a single pass through the entire training dataset during the training phase.
In other words, one epoch is completed when the algorithm has seen and used each data point in the dataset once to update the model’s parameters. The concept of epochs is commonly used in iterative optimization algorithms, including gradient descent, to improve the model’s performance over multiple iterations.

Here’s how the concept of epochs relates to gradient descent:

Single Epoch: During a single epoch, the algorithm goes through each data point in the training dataset, calculates the gradients based on the model’s current parameters, and updates the parameters to minimize the loss function. This process ensures that the model is updated using information from the entire training dataset.
Multiple Epochs: It’s often beneficial to go through the dataset multiple times. Each additional epoch allows the algorithm to refine the model’s parameters further by repeatedly updating them based on the entire dataset. This is especially important when the dataset is large, as it gives the algorithm more opportunities to learn and adjust.
Convergence and Performance: The number of epochs is a hyperparameter that you need to choose before training begins. The algorithm’s performance on the validation set or test set often guides the decision of how many epochs to use. Too few epochs may result in underfitting, where the model hasn’t learned enough from the data, while too many epochs could lead to overfitting, where the model starts fitting noise in the training data.
Early Stopping: To prevent overfitting and save computation time, you can use a technique called “early stopping.” This involves monitoring the validation error during training and stopping the training process when the validation error starts to increase, indicating that the model’s performance on unseen data is deteriorating.

In summary, an epoch in gradient descent corresponds to one complete iteration through the entire training dataset. By training the model over multiple epochs, you give it more opportunities to learn and refine its parameters, ultimately improving its performance on new, unseen data.

Types Of Gradient Descent:

Gradient Descent can be classified into three main types based on the amount of data used to compute the gradient in each iteration:

Batch Gradient Descent:

Uses the entire training dataset in each iteration to calculate the gradient of the cost function.
Updates model parameters based on the average gradient across all data points.
Provides accurate gradient estimates but can be computationally expensive for large datasets.

2. Stochastic Gradient Descent (SGD):

Uses only one randomly chosen training example in each iteration to calculate the gradient.
Updates model parameters frequently and with more noise due to the randomness of individual examples.
Converges more quickly, is less computationally intensive, and can escape shallow local minima.

3. Mini-Batch Gradient Descent:

Strikes a balance between Batch Gradient Descent and Stochastic Gradient Descent.
Divides the training dataset into smaller batches (mini-batches) and compute the gradient based on each mini-batch in each iteration.
Combines the benefits of both other methods: more stable updates than SGD and faster computation than Batch Gradient Descent.
The mini-batch size is a hyperparameter that can be adjusted based on available computational resources.