BTS Machine Learning: Demystifying Gradient Descent — Part 2
In Part 1 of this series, we delved into the foundational concepts of Gradient Descent algorithm, exploring the math behind it and the different types of Gradient Descent algorithms. In this Part 2, we will build on this foundation by implementing the three main variants of Gradient Descent in Python: Batch Gradient Descent, Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent. We will also try to visualize the parameter updates for a better understanding of how these algorithms converge to the optimal solution.
Implementing Gradient Descent
We will use a simple linear regression model with two parameters for our implementation. The model will aim to fit a straight line to the given data points.
The thought of implementing the algorithms from scratch may seem daunting at first. To make our lives a bit easier, let’s first try to break down the problem into smaller parts and summarize what we need to do:
Step 1. Initialize the parameters: We start with initial guesses for the model parameters (e.g., weights and biases in a linear regression model).
Step 2. Calculate the Gradient: Then we calculate the gradient of the cost function.
Step 3. Take a Step: After calculating the gradient, we update the model’s parameters by moving a small distance in the direction of the negative gradient (discussed in Part 1).
Step 4. Repeat Steps 2 and 3: We keep calculating gradients and taking steps until we reach a point where the gradient is nearly zero. This indicates we’ve likely found a minimum of the cost function.
The good news is that for all three types of Gradient Descent algorithms, most of the steps are the same. The only difference lies in the gradient calculation step!
Alright then, let’s get into the juicy parts of the task!
Generating Synthetic Data
To implement these algorithms, we will first generate some synthetic data.
import numpy as np
import matplotlib.pyplot as plt
import random
import seaborn as sns
sns.set(color_codes=True)
# dataset size
m = 500
# Generate synthetic data
np.random.seed(0)
X = 5 * np.random.rand(m, 1)
y = 4 + 3 * X + np.random.randn(m, 1)
# Add bias term (x0 = 1)
X_b = np.c_[np.ones((m, 1)), X]
Here, we have generated m = 500 data points where X is a random number between 0 and 5. The target variable y is generated using the equation:
If you are wondering why we need the noise, it’s simply because we do not want y to be exactly 4 + 3X. Even if we don’t use the noise term, we would still be able to fit our model to the data. In fact, in that case it would be easier to do so! But think about it, would you ever find this kind of “perfect” data in the real world?
The noise is generated using the term : np.random.randn(m, 1), which is nothing but random numbers drawn from a “standard normal” (mean = 0, std. dev. = 1) distribution.
The bias term has also been added to X. This is necessary because we want to fit the intercept (the constant term 4 in our case) of the linear model as well. Without the bias, the linear model would be a one parameter model instead of two parameter. As we already know that the underlying data was generated using a two parameter equation (Equation 2), it’s better to have the bias term for optimal model performance.
plt.plot(X, y, "b.")
plt.xlabel("X")
plt.ylabel("y")
Alright! We have managed to generate our data. Notice the effect of the noise term that we used. It has allowed us to generate data points with apparent linear relationship but not falling exactly on a single straight line.
Now that we have the dataset to work with, all that is left is to implement the algorithms. Let’s look at them one by one.
1. Batch Gradient Descent
To implement the BGD, we have to first keep in mind the key points about the algorithm:
- BGD uses the entire dataset in each iteration to compute the gradient.
- The model parameters are updated once per iteration based on the calculated gradient.
Recalling from Part 1, the MSE cost function J(θ) for BGD is defined as:
In matrix form, it can be written as (m = number of all training examples):
Therefore, the gradient of the cost function J(θ) will be:
Let’s use this gradient formula to write a function that can calculate the gradient values for BGD:
# function for calculating gradients for BGD
def BGD_gradient(X_b, theta, y):
BGDgradients = 1/m * X_b.T.dot(X_b.dot(theta) - y)
return BGDgradients
Now, we will use this function in the main driver code for BGD:
# Hyperparameters
learning_rate = 0.01
n_iterations = 1000
m = 500
# Initial parameters
theta = np.random.randn(2,1)
# Batch Gradient Descent
all_thetas_bgd = theta.copy()
for iteration in range(n_iterations):
# calculating the gradient
gradients = BGD_gradient(X_b, theta, y)
# updating the model parameters
theta -= learning_rate * gradients
# storing the updated parameters for plotting
# this is not part of BGD algo. Just an extra step for visualization
all_thetas_bgd = np.concat((all_thetas_bgd, theta), axis = 1)
In the above code, we define a function BGD_gradient() to calculate and return the gradient in each iteration. In each iteration, we pass the whole dataset (X_b and y) to be used for gradient calculation.
After gradient calculation, the gradient value is used to update the model parameters (theta):
theta -= learning_rate * gradients
Note the negative sign in the parameter update equation. As discussed in Part 1, the negative sign allows us to move the model parameters in the direction towards minimum cost function value. With positive sign instead of negative, the algorithm would start moving towards higher cost function values!
This process is repeated n_iterations times. Learning rate and no. of iterations are two crucial hyperparameters for any Gradient Descent algorithm. Feel free to play around with the values and see how the algorithm reacts!
Learning rate : We briefly discussed in Part 1 how learning rate can affect the training process of a model. In general, it is better to choose a small learning rate value to start with.
No. of iterations: If n_iterations is too small, the training process would stop much before BGD finds the vicinity of lowest cost function value. If it is too large, the code can take a long time to run, and the model can overfit the training data and fail to generalize well to unseen data.
Now, let’s look at what our training process has produced:
# printing the fitted model
print("Batch Gradient Descent: ")
print("y = {θ0} + {θ1} x".format(θ0 = all_thetas_bgd[0,-1].round(2), θ1 = all_thetas_bgd[1,-1].round(2)))
# plotting the parameter updates
plt.plot(all_thetas_bgd[0,:], all_thetas_bgd[1,:], "b.")
plt.xlabel("1")
plt.ylabel("2")
plt.title("Batch Gradient Descent")
plt.show()
Output:
BGD has given us the linear model y = 3.87 + 3.01x. Not exactly same as the actual model we used to generate the data, but it is close enough!
The θ0 vs θ1 plot (fig 2) shows us the parameter update progression of the BGD algorithm. Each dot in the plot represents the model parameters after an iteration of the algorithm. Bottom left point represents the initial parameter values and top right the final parameter values.
Notice how the dots keep getting closer to each other as the algorithm progresses closer to the final parameter values. This is because as the algorithm approaches the cost function minima (lowest cost function value), the gradients become smaller in magnitude, and hence, the change in parameters becomes smaller (theta -= learning_rate * gradient).
2. Stochastic Gradient Descent
SGD uses only one randomly selected data point in each iteration to compute the gradient, unlike BGD, which uses the complete dataset in each iteration.
Recalling from Part 1, the MSE cost function J(θ) for SGD for a single data point (xi, yi) is defined as:
Or, in matrix form:
Wait a minute. Why did the summation that we had in the BGD cost function disappear from the SGD cost function? This is because in SGD we use only one data point in each iteration to calculate the cost and gradient. However, in BGD we use the complete data set, therefore, the summation is necessary to calculate the cost.
This gives us the gradient function in matrix form as:
We will use this gradient function formula to calculate the gradients in SGD code:
# function for calculating gradients for BGD
def SGD_gradient(xi, theta, yi):
SGDgradients = xi.T.dot(xi.dot(theta) - yi)
return SGDgradients
For SGD, in each iteration, we choose a random data point out of the data set, and use the chosen single data point to calculate the gradient. The rest remains the same as BGD:
learning_rate = 0.01
n_iterations = 1000
# Initial parameters
theta = np.random.randn(2,1)
# Stochastic Gradient Descent
all_thetas_sgd = theta.copy()
for iteration in range(n_iterations):
# choosing one random data point to calculate the gradient
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
# calculating the gradient based on the one data point
gradients = SGD_gradient(xi, theta, yi)
# updating the model parameters
theta -= learning_rate * gradients
# storing the updated parameters for plotting
# this is not part of SGD algo. Just an extra step for visualization
all_thetas_sgd = np.concat((all_thetas_sgd, theta), axis = 1)
Now let’s look at the results:
# printing the fitted model
print("Stochastic Gradient Descent: ")
print("y = {θ0} + {θ1} x".format(θ0 = all_thetas_sgd[0,-1].round(2), θ1 = all_thetas_sgd[1,-1].round(2)))
# calculating predicted y using the final (last iteration) parameter value
y_predicted = X_b.dot(all_thetas_sgd[:,-1])
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
# plotting the fitted model
ax1.plot(X, y, 'b.',label="y_actual")
ax1.plot(X, y_predicted, 'r-', label="y_predicted")
ax1.set(xlabel="X", ylabel ="y", title = "Final Model Fit")
ax1.legend()
# plotting the parameter updates
ax2.plot(all_thetas_sgd[0,:], all_thetas_sgd[1,:], "b.")
ax2.set(xlabel="θ0", ylabel ="θ1", title = "Parameter update progression")
plt.show()
Output:
SGD produces the linear model y = 3.81 + 3.17x. Again, close enough to the actual model used to generate the data! (You might get a slightly different result on running the same code because of the random nature of the algorithm)
In the θ0 vs θ1 plot (fig 3), note the jittery and noisy nature of the SGD parameter update progression as compared to BGD, which was much smoother. This randomness in SGD is because of the fact that in each iteration we use only one random data point to calculate the gradients.
The SGD characteristic of using only one data point to calculate the gradient in each iteration makes it faster, but it also makes it prone to fluctuations and randomness in the training process.
3. Mini-Batch Gradient Descent
For implementing Mini-Batch Gradient Descent algorithm, it’s crucial to remember that in MBGD, we use a mini-batch (or a subset in other words) of the dataset to calculate the gradients in each iteration. MBGD is sort of mid-way between BGD, which uses complete dataset, and SGD, which uses only one data point per iteration to calculate the gradient.
For MBGD,
Cost function :
Hence, gradient is:
Using the gradient formula, we write a function to calculate the gradients for MBGD:
def MBGD_gradient(xi, theta, yi):
gradient = 1/xi.shape[0] * xi.T.dot(xi.dot(theta) - yi)
return gradient
For MBGD, we first define a new hyperparameter batch_size, which is basically the size of each of the mini-batches that we use to calculate the gradient upon.
In each iteration, we first shuffle the training data and then select batch_size number of data points to calculate the gradient upon. In our case, shuffling might not make much difference because we know that the data points in the training data have been randomly generated and stored. However, it is best practice to shuffle because in real world data, consecutive training examples might be highly correlated and there might be some local patters. This can lead to poor generalizing power of the trained model.
# Hyperparameters
learning_rate = 0.01
n_iterations = 1000
m = 500
batch_size = 10
# Initial parameters
theta = np.random.randn(2,1)
# Mini-Batch Gradient Descent
all_thetas_mbgd = theta.copy()
for iteration in range(n_iteration):
# shuffling the training data
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b [shuffled_indices]
y_shuffled = y [shuffled_indices]
# randomly choosing subset (mini-batch) of data
chosen_indices = random.sample(range(0, m), batch_size)
# calculating gradients
xi = X_b_shuffled[chosen_indices]
yi = y_shuffled[chosen_indices]
gradients = MBGD_gradient(xi, theta, yi)
# updating the model parameters
theta -= learning_rate * gradients
all_thetas_mbgd = np.concat((all_thetas_mbgd, theta), axis = 1)
# printing the fitted model
print("Mini-Batch Gradient Descent: ")
print("y = {θ0} + {θ1} x".format(θ0 = all_thetas_mbgd[0,-1].round(2), θ1 = all_thetas_mbgd[1,-1].round(2)))
# calculating predicted y using the final (last iteration) parameter value
y_predicted = X_b.dot(all_thetas_mbgd[:,-1])
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
# plotting the fitted model
ax1.plot(X, y, 'b.',label="y_actual")
ax1.plot(X, y_predicted, 'r-', label="y_predicted")
ax1.set(xlabel="X", ylabel ="y", title = "Final Model Fit")
ax1.legend()
# plotting the parameter updates
ax2.plot(all_thetas_mbgd[0,:], all_thetas_mbgd[1,:], "b.")
ax2.set(xlabel="θ0", ylabel ="θ1", title = "Parameter update progression")
plt.show()
Output:
MBGD produces the linear model y = 3.89 + 3.01x. Again, close enough! (You might get a slightly different result on running the same code because of the random nature of the algorithm)
In the θ0 vs θ1 plot (fig 4), note that the parameter update progression for MBGD is sort of in between BGD and SGD. It is smoother than SGD, but noisier than BGD.
MBGD is a compromise between the speed of SGD and the stability of BGD.
Conclusion
plt.scatter(all_thetas_sgd[0,:], all_thetas_sgd[1,:], color="r", s=10, label = "SGD")
plt.scatter(all_thetas_mbgd[0,:], all_thetas_mbgd[1,:], color="g", s=10, label = "MBGD")
plt.scatter(all_thetas_bgd[0,:], all_thetas_bgd[1,:], color="y", s=2, label = "BGD")
plt.xlabel("θ0")
plt.ylabel("θ1")
plt.legend()
We have looked into what happens in the background in Gradient Descent, and have implemented our own versions of the three types of Gradient Descent algorithms. I think that the plot in fig 5 summarizes this discussion the best.
Batch Gradient Descent: Smooth convergence, but can be slow. Uses the entire dataset to calculate the gradient in each iteration.
Stochastic Gradient Descent: Faster convergence but with erratic movements. Uses a single data point to calculate the gradient in each iteration.
Mini-Batch Gradient Descent: Faster than BGD and smoother than SGD. Uses a subset (mini-batch) of the dataset to calculate the gradient in each iteration.
The python implementations shown here are in no way production ready, however, hopefully they will give you a good understanding of what happens behind the scenes in Gradient Descent.
In machine learning, there is no one fits all solution. It just can not exist. Gradient Descent algorithms are no exception. It all comes down to which one suits the particular task you are working on.