Momentum optimizer !!!

vinodhkumar baskaran
6 min readApr 22, 2020

--

This is article is the continuation of my previous article on Optimizer introduction.

Let have the quick refresher!!!

What is an optimizer?

Optimizer helps us to minimize an error function(loss function)or to maximize the efficiency of production. Optimizer is simply a mathematical function dependent on the model’s learnable parameter [W’s (weights/coefficients),b (bias)]which is used in computing the target value [yi’s] from the set of inputs[xi’s].

How does the optimizer minimize the loss?

It minimize the cost function(loss function)by finding the optimized value for learnable parameters. This helps in generalizes the model as well.

Gradient Descent

Gradient descent is an iterative learning optimization algorithm to reduce the cost function.

Types of Gradient Descent

Different types of Gradient descents are

  • Gradient Descent or Vanilla Gradient Descent
  • Stochastic Gradient Descent
  • Mini batch Gradient Descent
Was that the quick refresher!!!

Now let us more advanced optimizer in practice….,,,

  1. Gradient descent with Momentum

What is the problem with all the variant of gradient descent is that it takes lot of time to pass through the gentle slope.This is because at gentle slope gradient is very small so update becomes slow.

To solve this problem we will use the idea of momentum incorporated to gradient descent.

What is momentum?

Momentum is like a ball rolling downhill. The ball will gain momentum as it rolls down the hill.

In simple term,you want to reach to the destination which entirely new for you.what will you do.you will ask direction from the person nearby. That person will direct you in some direction so you move in that direction slowly.After reaching certain distance , again you ask the another person for the direction and that person also point you in the same direction, then you will move in that direction with bit more acceleration.So your acceleration will definitely increase as you gain information for the later person and earlier person are same. This phenomena is called momentum.

Same idea is used in gradient descent as well.

Update for GD with Momentum

In addition to the current update, it also look at the history of updates.

γ → Hyperparameter[0–1] . This is used to give more important to later updates than earlier.

Take away: Even in the regions having gentle slopes, momentum based gradient descent is able to take large steps because the momentum carries it along.

Observation:

  1. SGD without momentum , Updates takes longer vertical steps [slower learning]than horizontal step[faster learning].
  2. SGD with momentum,Updates takes longer Horizontal steps[faster learning] then vertical step[slower learning].

Problem :

Momentum based gradient descent oscillates in and out of the minima valley as the momentum carries it out of the valley Takes a lot of u-turns before finally converging Despite these u-turns it still converges faster than vanilla gradient descent.

Think of the problem like rolling a ball from the hill top , it will rest in the valley for sure but it will oscillate to and from near the valley before resting.

2. Nesterov Accelerated Gradient (NAG):

Can we do something to reduce the oscillations? Answer is “YES”.

So we need past updates to gain / incorporate momentum into Gradient descent /SGD.

Here some the simple hack to solve this oscillation.

  1. Consider only the history updates [γ .update @ t-1] from time t=0 to t-1 and move by those updates.
  2. Now calculate the gradient at that moved point[look_ahead gradient] instead of calculating at gradient at the previous point(point before moving by history updates distance).
  3. Now update the weight.

Observation:

  • The oscillations are smaller and the chances of escaping the minima valley also smaller

3. AdaGrad (Adaptive gradient):

Motivation behind this algorithm: Let us consider the dataset which has dense features and sparse feature. During training if learning rate (η) is fixed to some value (Say η =0.001) then training happens with same eta across the dataset. Okay well and good.But there is a problem,if the feature is dense[Word2vec] then there updates will be faster and when the feature are sparse [BOW]then updates will be slower.

Hence this will result in slower convergence.To avoid this problem we come with technique called Adaptive Gradient.

Idea of AdaGrad is that different learning rate(η) for different feature at each iteration(time).

To explain the above idea, if the feature undergone more updates then make the η smaller .If it is the other case then make the η larger.

It adapts the learning rate to the parameters, performing smaller updates
(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features.This solve the problem of updation in the sparse feature.

Don’t be scared after looking at the equation , it is same as the update equation with small change in the η .

Above idea is achieved by the intuition of Decay the learning rate for parameters in proportion to their update history.

so we accumulation of the squared gradients in the denominator for achieving the decaying learning rate.

Problem with this approach is that when denominator Gi,t becomes very large then η becomes very very small resulting in “very slow convergence” and “vanishing gradient problem”.

4. AdaDelta and RMSprop:

AdaDelta and RMSprop both are small enhancement over AdaGrad.

Problem of slower convergence is addressed by these algorithms.

Idea of AdaDelta and RMSprop is that instead simply using the square root summation of squared history gradient why can’t we use exponential decaying average(eda)/exponentially moving average /running average.

So by applying the running average we are controlling the growth of the denominator.

In simpler words, we are using the mean of the history gradient to decay the learning rate.

5. Adam ( Adaptive moment estimator):

Idea of Adam optimizer is that instead of only using the exponentially weighted average of square of history of gradients,why can’t we use the exponentially weighted average of square of history of gradients .

The reason we are doing this is that we don’t want to rely too much on the current gradient and instead rely on the overall behaviour of the gradients over many timesteps

mt → 1st order moment of history of gradients.

vt → 2nd order moment of history of gradients.

In statistics, mt is similar to mean of history gradients and vt is similar to variance of history gradient.

mˆ t and vˆt are bias correction for the mean and variance.

Why we need bias correction?

Mean of history gradient ≈(approximately equal) Exponentially weighted average (EWA) of history gradient.(mt) → 1

Variance of history gradient ≈(approximately equal) Exponentially weighted average of history gradient.(vt) → 2

Proving the bias term with Eqn1:

Note : Expectation is other word for mean.

Same derivation can be derived for variance as well

Star is the minima

Which Optimizer to use ?

Reference :

NPTEL notes : CS7015

https://arxiv.org/pdf/1609.04747.pdf

--

--

vinodhkumar baskaran

Experienced data practitioner skilled in Python, ML, statistics, and NLP. Proactive problem solver, exceeding expectations with a positive attitude.