BTS Machine Learning: Demystifying Gradient Descent — Part 1
Welcome to behind the scenes Machine Learning! Today, let’s put the spotlight on Gradient Descent, a foundational algorithm that works in the background to optimize a vast number of machine learning models.
Gradient Descent is an iterative method for finding the minimum of a function. In machine learning, this function is typically the cost or loss function, which measures how well a model’s predictions match the actual data. Minimizing the cost function is equivalent to minimizing the prediction errors, and hence, is equivalent to “training” a model.
This is probably a good time to stress the fact that Gradient Descent is not a machine learning model or algorithm directly used for making any kind of predictions. It is an optimization algorithm that is used to minimize the errors or the cost functions, and hence, to “train” such models.
Intuitive Understanding — The Mountain Climbing Analogy
Before delving into the mathematics of Gradient Descent, let’s first try to have an intuitive understanding of the Gradient descent algorithm and how it works.
Imagine you’re a hiker lost in the mountains in dense fog. To survive, you aim to reach the lowest (or as close to lowest as feasible) point in the valley as quickly as possible. You can’t see the entire landscape, but you can feel the slope beneath your feet. What would you do? You would perhaps feel the slope and take steps downhill hoping to eventually reach the lowest point!
Gradient Descent works similarly, but, of course, in the mathematical landscape of a model’s cost function. Here, reaching the lowest point in the valley means finding the set of model parameters that result in the lowest cost function value, and hence, the best model performance.
In each iteration, Gradient Descent “feels” the slope of the cost function landscape by calculating something called the gradient of the cost function and then, based on the gradient value, adjusting the model’s parameters (taking a “step”) in the direction that reduces the cost function the most.
The Math Behind the Magic
To understand the mathematics behind Gradient Descent, we must first understand what a gradient is.
In our mountain analogy, the gradient is like an arrow pointing uphill in the steepest direction. The longer the arrow, the steeper the slope. If you were to take a step in that direction, you’d climb up the hill.
For a mathematical function, the gradient tells us the direction in the parameter space that would increase the cost function the most if we moved our model’s parameters in that direction. In Gradient Descent, since our goal is to minimize the cost function, we want to move in the direction opposite to the gradient.
For a function with multiple inputs (like the parameters of a model), the gradient is a vector containing the partial derivatives of the function with respect to each input. Let’s say our cost function is J(θ0, θ1, …, θn), where θ0, θ1, …, θn are the model’s parameters. The gradient of this function is denoted by ∇ J and is calculated as:
Now that we understand what a gradient is, let’s get into the workings of the Gradient Descent algorithm:
Step 1. Initialize the parameters: We start with initial guesses for the model parameters (e.g., weights and biases in a linear regression model).
Step 2. Calculate the Gradient: The gradient of the error function gives us the direction of ascent (that is, moving towards higher cost/error). We want the opposite, so we negate the gradient to get the direction of descent (because we want to move towards lower cost/error). Have a look at Figure 1 for better understanding.
Step 3. Take a Step: We update the model’s parameters by moving a small distance in the direction of the negative gradient. The size of this movement is determined by a hyperparameter called the “learning rate” and the magnitude of the calculated gradient.
Step 4. Repeat Steps 2 and 3: We keep calculating gradients and taking steps until we reach a point where the gradient is nearly zero. This indicates we’ve likely found a minimum of the error function.
Mathematically, the parameter update rule is:
where:
- θ represents the model parameters
- α is the learning rate
- ∇J(θ) is the gradient of the cost function J(θ)
Note the negative sign in the parameter update rule. This is because we want to move in the direction opposite to the gradient to minimize the cost function. Without the negative, we would end up maximizing the cost function!
In the visualization of Gradient Descent in figure 2, the model parameters are initialized randomly and get updated repeatedly to minimize the cost function; the learning step size is proportional to the slope of the cost function, so the steps gradually get smaller as we approach the minimum.
Types of Gradient Descent
Gradient Descent comes in several flavors, each with its own tradeoffs. Let’s try to understand each of them with the help of a simple one independent variable Linear Regression model (with bias term):
To keep the explanation simple and easy to understand, we will use MSE or Mean Squared Error as the cost function. As the name suggests, MSE is nothing but the mean of the squares of all the individual errors.
1. Batch Gradient Descent (BGD)
Batch Gradient Descent computes the gradient of the cost function using the entire training dataset. This means that each parameter update is performed after evaluating all data points.
Pros: More accurate gradient estimation.
Cons: Can be very slow for large datasets since each iteration requires going through the whole dataset for gradient calculation.
Cost Function (MSE):
Where m is the number of all training examples
Gradients:
∇ J = [∂J/∂θ0, ∂J/∂θ1]
Parameter Update:
2. Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent updates the model parameters based on gradients calculated using just one randomly selected training data point rather than the entire dataset.
Pros: Faster for large datasets, can escape local minima.
Cons: More noisy and erratic parameter updates, which can lead to fluctuations in the cost function.
Cost Function:
A special case of BGD cost function with m=1.
Gradients:
∇ J = [∂J/∂θ0, ∂J/∂θ1]
Parameter Update:
The parameter update formula is the same as BGD. Only the values of the calculated gradients change.
Note the difference in cost functions and gradients between BGD and SGD. In BGD, we were using all the data points to calculate the cost and gradients in each iteration, therefore we needed to sum all the errors over all the data points. However, in SGD, because we are using just one data point to calculate the cost and gradient in each iteration, there is no need for any summation.
3. Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a compromise between BGD and SGD. It splits the training data into small batches and performs an update for each batch. This method balances the accuracy of BGD with the speed of SGD.
Cost Function:
Where B is the mini-batch of training examples and b is the size of B.
Gradients:
∇ J = [∂J/∂θ0, ∂J/∂θ1]
Parameter Update:
The parameter update formula is the same as BGD. Only the values of the calculated gradients change.
Note that the summation in the cost function and gradients is back again! In this case, however, the summation is over the smaller batch B instead of over the complete dataset. This is because we calculate the cost and gradients over the smaller batch size in each iteration in Mini-Batch Gradient Descent.
Important Considerations
Learning Rate:
In gradient Descent algorithm, choosing the right learning rate is crucial. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time:
On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution:
Feature Scaling
Standardizing or normalizing features can help Gradient Descent converge faster. Feature scaling ensures that all features contribute equally to the model’s training process, preventing dominance by features with larger scales.
Figure 5 is the 2-d projected contour plot of a two parameter cost function J(θ1,θ2). The same colored circular and oval regions in the plots have the same value of cost function, with the values decreasing as we move towards the center. The paths shown in blue is the path taken by the Gradient Descent algorithm to reach the minimum value, with each dot representing one iteration of parameter update.
As you can see on the left plot, the Gradient Descent algorithm goes straight toward the minimum, thereby reaching it quickly. Whereas on the right it first goes in a direction almost perpendicular to the direction of the global minimum. It eventually reaches the minimum, but takes a long(er) time.
Summary
Gradient Computation:
> Batch Gradient Descent: Uses the entire dataset to calculate the gradient.
> Stochastic Gradient Descent: Uses a single data point to calculate the gradient.
> Mini-Batch Gradient Descent: Uses a subset (mini-batch) of the dataset to calculate the gradient.
Update Frequency:
> Batch Gradient Descent: Updates the model parameters after processing the entire dataset.
> Stochastic Gradient Descent: Updates parameters after processing each data point.
> Mini-Batch Gradient Descent: Updates parameters after processing each mini-batch.
Convergence:
> Batch Gradient Descent: Smooth convergence, but can be slow.
> Stochastic Gradient Descent: Faster convergence but with erratic movements and potential fluctuations.
> Mini-Batch Gradient Descent: Balanced approach with faster and more stable convergence.
In the Part 2 of this blog, we will delve into custom and out-of-the-box python implementations of the three Gradient Descent algorithms discussed.