A visual understanding of Gradient Descent
There are a lot of blogs and posts discussing Machine Learning basics and ideas mathematically. So, in this series of posts, I’ll try to explain wide variety of Data Science and Machine Learning concepts succinctly with minimal math.
The animation code can be found here : Github Link
Gradient Descent is a widely used optimization algorithm in Machine Learning. The main idea behind Gradient Descent is to move in a direction which facilitates your objective — Getting to a minima or maxima. Let’s use this analogy of getting up to a top of a mountain or a bottom of a mountain. We would use Gradient Ascent in the first case and Gradient Descent in the second. Note that the only difference between the two is only a negative sign for Gradient Descent.

Imagine this, suppose you are at the top of a mountain, blindfolded, wanting to reach the pond at the bottom. What would be your obvious approach? Feel the ground and proceed in a direction which feels descending. That’s exactly what Gradient Descent does.
Where is Gradient Descent used?
Let’s take a simple example of a Linear Regression.

In the figure above, while building a Regression model the only freedom you have is to tweak with Beta0 and Beta1. So depending on how well you choose these, will determine how good your model is going to be.
So how to choose them?
Here’s where Gradient Descent comes in. For “random” values of Beta0 and Beta1 to start with, you calculate how good or bad your model is. And then with incremental progress, you reach a solution which is good enough.
Choosing Learning Rate :
Let’s go back to our example of descending from a hill. You can choose how fast you want to descend, depending on how quickly you want to reach the pond. It’s the same with Gradient Descend. And we use Learning Rate to mathematically represent it.
A word of caution though, if you set a Learning Rate that is high, then you might miss the pond and end up far from it. But if you set a Learning Rate that is low, it might take long time for you to reach the pond. Hence it becomes important to choose optimal hyper parameters depending on data.
Parameters like Learning Rate which you can choose are called Hyper Parameters in Machine Learning. You will come across them in most of the algorithms. So how to choose hyper parameters? You use Cross Validation! More on Cross Validation in future posts!
To summarize Gradient Descent,
Step-1 : Choose optimal Learning Rate
Step-2 : Choose random parameters.
Step-3 : Keep changing those parameters based on gradient to reduce the overall Loss Function until you wind up at the minimum.
Note: Since the overall idea is improving on something that was selected randomly, it’s better to repeat this process several times so that the probability of hitting the global minimum increases.
Note : Notice how regression is getting better with more iterations. Thanks to Gradient Descent!
Variants of Gradient Descent :
The above algorithm of Gradient Descent is the vanilla version also called as Batch Gradient Descent where we use the complete training data to make an update. There are other variants of Gradient Descent which are generally used in real world.
- Stochastic Gradient Descent : Instead of using the complete training data to calculate gradients and make an update, Stochastic Gradient Descent uses a single example. Hence it is very fast. Though it doesn’t seem logical to take a single example to make updates, it was found that the accuracy falls in a similar range. But it is blazing fast!
- Mini-batch Gradient Descent : Mini-batch Gradient Descent is a hybrid of Batch Gradient Descent and Stochastic Gradient Descent and uses a small, randomized subset of training data to make an update.
Other Optimization algorithms :
Gradient Descent and it’s variants are just a start. There are a ton of other optimization algorithms that are much faster than Gradient Descent. Few of them are as follows,
- Gradient Descent with Momentum
- Adagrad
- Adadelta
- RMSProp
- Adam
If you are interested in understanding Math behind all the algorithms then I recommend you to take a look at this. Link
Let me know if you have any questions and inputs are appreciated.Do visit my Linkedin page, here.
