A Comprehensive Guide to Gradient Descent

The canny and powerful optimization algorithm

Published in

SRM MIC

10 min readMar 29, 2021

We use many optimizations in our day-to-day life, based on our personal decisions, and are unaware that what we are doing is optimization in itself be it when we take the shortest path to avoid traffic to reach somewhere, or when we buy something which has a minimum cost and maximum benefits and so on.

So whether we are dealing with some actual events in real-life or some technology-based products, optimization is our ultimate goal. Optimization can be considered as one of the most essential components of Machine Learning and Deep Learning techniques which are widely used in Data Science. One of the most popular optimization techniques is “Gradient Descent”.

So, without further ado let’s talk about what gradient descent is and how it works.

What is Gradient Descent?

Gradient descent is a general and popular optimization algorithm in machine learning and deep learning used to find the minimum of a function.

In these fields, it is used to find the value of the parameters for which the function becomes minimal. The term parameter may vary for different algorithms such as coefficients for Linear Regression and weights for Neural Networks. This is based on a convex function and it iteratively alters the parameters to minimize a given function to its local minimum.

Now let’s start analyzing what gradient descent actually is and relate gradient descent with real-life examples to understand the concept well. Let’s assume that it’s foggy and a man is at the top of a mountain from where he wants to descend. He might take the steps and look for the slope whether it is upward or downward and once he is sure of a downward slope he starts following that and takes the steepest step which he can see near him and similarly he looks for another step and repetitively uses these steps until he descends completely from the mountain (or reached the minima).

Similarly, let’s take another real-life example to understand the concept more clearly.

Suppose you have a bowl (which is apparently the plot of the cost function). Initially, you placed a ball at position A (which is the cost of the current values of parameters). We have to reach the bottom point B of the bowl (cost of the best set of parameters which minimizes the function).

Basically, this is what happens in the Gradient Descent algorithm. Gradient Descent is used to provide direction and the velocity (learning rate) of the movement to attain the minima of the function i.e., where the cost is minimum.

How does Gradient Descent Work?

Suppose we have a cost function J(θ0, θ1) which depends on the parameters θ0 and θ1. Assume initially we are at point A. Our goal is to minimize our cost function and reach its local minimum (i.e., point B) by tweaking its parameter (i.e., θ0 and θ1).

Outline

Initially, we have a function i.e., J(θ0, θ1) and we have to minimize J(θ0, θ1) by tweaking θ0 and θ1.

● Start with some θ0 and θ1.

● Keep changing θ0 and θ1 to reduce J(θ0, θ1) until we hopefully end up at a minimum.

In mathematical terms, we can write,

We have to update the parameters simultaneously in the given order,

Here, α is the learning rate.

To understand the concept well, let’s take the case of Linear Regression. Consider an example of a model based on certain housing data which comprises the sale price of the house, the size of the house, etc. Suppose we want to predict the pricing of the house based on its size.

The hypothesis is usually presented as

and cost function represented as

where m is the number of training examples,

Now to find θ0 and θ1 for which we end up at a minimum of the function, we have to find the derivative of the cost function,

Now our gradient descent algorithm will look something like this,

Now the question coming into your mind could be what’s the learning rate? What is the appropriate value of our learning rate alpha(α)? How can we set it accordingly? Well, let’s see.

Learning Rate(α)

In the equations mentioned above, we may say that the learning rate(α) is a tuning parameter that determines step size at each iteration while moving towards the minimum of a function.

For gradient descent to reach the local minimum of the function we must set the learning rate(α) with an appropriate value, which is neither too high nor too low. This is important. After all, if we set the learning rate to a very large value, it may not reach the local minimum because it bounces back and forth between the convex function of gradient descent and if we set the learning rate to a very small value, gradient descent will eventually reach the local minimum but that may take a while.

Comparing a large and small Learning-rate

Now how can we make sure that our gradient descent is working correctly? How can we check that the value of the learning rate chosen is right? For this purpose, we are going to learn how to debug.

Debugging

A good way to check whether our gradient descent is working properly or not is by plotting a graph between the no. of iterations and the cost function with the former at the x-axis and the latter at the y-axis. This approach helps us to see the value of the cost function after every iteration and provides an easy way to check how appropriate the learning rate is. We may just try different values for it and plot them all together. The following image on the left shows such a plot, while the image on the right illustrates the difference between good and bad learning rates.

If gradient descent is working properly, the cost function should decrease after every iteration and when the cost function decreases by less than 10–3, we may declare the convergence. The number of iterations gradient descent needs to converge may sometimes vary a lot. It might take 100 iterations, 1000, or maybe even a million, making the number of iterations for convergence difficult to estimate in advance.

Illustration of gradient descent with plot between no. of iterations and cost function

The advantage of monitoring gradient descent via plots is that it allows us to easily spot if it doesn’t work properly, e.g., if the cost function is increasing. Most of the time the reason for an increasing cost-function is the too high value of learning rate. If the plot shows the curve just going up and down, without reaching a lower point, try decreasing the learning rate. Also, when starting with gradient descent on a given problem, simply try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, etc., as the learning rates and look at which one performs the best.

Types of Gradient Descent

There are three popular types of gradient descent that differs based on the size of data they use:

● Batch Gradient Descent

● Stochastic Gradient Descent

● Mini-Batch Gradient Descent

1) Batch Gradient Descent:

In Batch gradient descent, all the training data are processed for each iteration of the gradient descent as a result of which it is very slow on large datasets. As a result, it becomes computationally expensive to use batch gradient descent on such large datasets, rather we use stochastic gradient descent and mini-gradient descent. The algorithm produces a stable error gradient and a stable convergence. However, it sometimes results in a state of convergence which isn’t the best fit for the model.

2) Stochastic Gradient Descent:

A modification to basic gradient descent allows us to work on very large datasets which leads to Stochastic Gradient Descent (SGD). In SGD, one sample is selected at random for each iteration instead of selecting the entire dataset, i.e. a batch size of one to perform each iteration. In this, the parameters are updated even after one iteration where only one data has been processed. Thus, it optimizes faster than batch gradient descent. To use SGD we shuffle the dataset randomly to ensure that the parameters are trained evenly for each type of data.

3) Mini-Batch gradient descent:

Mini-batch gradient descent is the go-to method since it’s a combination of both Batch Gradient Descent and Stochastic Gradient Descent. It simply splits the training dataset into small batches and for each of those batches, it performs an update. In this way, it creates a balance between the robustness of Stochastic Gradient Descent(SGD) and the efficiency of batch gradient descent.

Algorithm for mini-batch gradient descent

Let us consider m to be the total number of training examples and b to be the number of examples in one batch, where b<m. For simplicity, let’s assume b=10 and m=1000. The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieves better run time with common batch sizes such as a power of 2.

Convergence in different variants of Gradient Descent

In the case of Batch Gradient Descent, if the cost function is convex, then it converges to a global minimum. However, if the cost function is not convex, it converges to a local minimum. The learning rate is typically held constant over here.

While in the case of stochastic gradient descent and mini-batch gradient descent, the algorithm instead of converging keeps on fluctuating around the global minimum. To converge, the learning rate needs to be changed slowly.

Tips for Gradient Descent

Here, we shall learn about some of the tips and tricks for getting the most out of the gradient descent algorithm.
● Feature Scaling: If the input has a variety of ranges, try to achieve a range such as [0, 1] or [-1, 1] by scaling all the input variables. It reaches the minimum cost faster if the shape of the cost function is not skewed.
● Plot Cost versus Time: It is suggested to collect and plot a graph between the cost values calculated by the algorithm and the no. of iteration. This helps to keep track of the gradient descent. Ideally, the cost always decreases in each iteration. If there is no decrease, the learning rate should be tweaked.
● Learning Rate: When starting with gradient descent on a given problem, simply keep trying 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, etc., as the learning rates and look at which one performs the best.
● Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. It is advisable to try and take the average over 10, 100, or 1000 updates. This gives a better idea of the learning trend for the algorithm.

Conclusion

In this blog, we have learned about the optimization technique Gradient Descent, a simple and very popular optimization technique that may be used all over the place in machine learning. We have seen, what Gradient Descent is and how it works, what learning rate is and how its value is of utmost importance, the mathematics behind gradient descent, and how can we debug our algorithm by using a visual method. We have also explored the various types of gradient descent, convergence trends in the different forms of the same, and discussed some tips and tricks to get the best output.

With optimization being the heart and soul of Machine Learning, Gradient Descent- The First Optimizer has been the precursor and foundation for almost all of the optimizers we see today. I hope this blog helps give the intuition behind the famous optimization algorithm!