Gradient Descent with Momentum, RMSprop And Adam Optimizer

Harsh Khandewal
Analytics Vidhya
Published in
4 min readAug 4, 2020

Optimizer is a technique that we use to minimize the loss or increase the accuracy. We do that by finding the local minima of the cost function.

Our parameters are updated like this:

When our cost function is convex in nature having only one minima which is its global minima. We can simply use Gradient descent optimization technique and that will converge to global minima after a little tuning in hyper-parameters.

But in real world problems the cost function has lots of local minima. And the Gradient Descent technique fails here and we can end up in local minima instead of global minima.

So, to save our model from getting stuck in local minima we use an advanced version of Gradient Descent in which we use the momentum.

Imagine a ball, we started from some point and then the ball goes in the direction of downhill or descent. If the ball has the sufficient momentum than the ball will escape from the well or local minima in our cost function graph.

Gradient Descent with Momentum considers the past gradients to smooth out the update. It computes an exponentially weighted average of your gradients, and then use that gradient to update the weights.

Let our bias parameter be ‘b’ and the weights be ‘w’, So When using the Gradient descent with momentum our equations for update in parameters will be:

RMSprop Optimizer:

Here below is a 2D contour plot for visualizing the work of RMSprop algorithm,in reality there are much higher dimensions.

When we are dealing with large datasets, we use a batch of data at a time.Since it contains a huge variety and noise, the Gradient Descent makes huge oscillation in its path and takes long time and more iteration to converge.In RMSprop optimization technique we update our parameters in such a way that the the movement in direction of weights w is more than the movement in direction of ‘b’.

Here ‘S’ is the exponential average of gradients.We let that the dw² is smaller than db² and hence the exponential average for dw is smaller than that of db.

So, when we divide the gradients by the roots of their respective exponential average, the update in the ‘W’ will be more than that of ‘b’ ,this allows us to take more large steps in horizontal direction and converge faster, it also decreases the number of number of iteration to converge to the optimal value. This gives the algorithm its name Root Mean Squared Propagation.

Adam optimizer:

It turns out that when we use momentum and RMSprop both together, we end up with a better optimization algorithm termed as Adaptive Momentum Estimation.

Here the first equation takes the account of momentum which we have seen above and the second equation is from the RMSprop optimization algorithm. We also introduced a new hyper parameter ‘ε’ epsilon which ensures that we don’t end up with dividing very small values of exponential average, by default the hyper parameters are:

The value of learning rate has to be tuned .

Adam optimization algorithm performs very well in many different problems in deep learning.

Conclusion:

There are lots of optimizer to choose from, knowing them how they work will help you choose an optimization technique for your application. I hope you found this article beneficial ;)

--

--

Harsh Khandewal
Analytics Vidhya

Hi! My name is Harsh Khandelwal. I am a computer science student at NIT- Tiruchirappalli, India. I have good experience in data science.