Simplifying Optimization : A Study of AdaGrad

Let us begin to grasp the concept of AdaGrad with a brief overview of one of the most important optimization algorithms in deep learning — Gradient Descent. It will be in regulating the hyper-parameters of Gradient Descent that we will utilize the concept of AdaGrad or Adaptive Gradient Method.

A plot of Cost Function J( ϴ) vs ϴ

Gradient Descent, in short, is the procedure of finding the right set of values of ϴ (i.e ϴ optimal) to minimize the cost function J(ϴ) .

θt+1 = θt — α(∂L∕∂θt)

Thus as is seen in the above equation, based on α and the gradients (∂L∕∂θt) we take steps in updating our parameter θt+1 to gently reach the global optimum θoptimal .

In these gifs it is evident that the learning rate plays a huge role in determining how well the algorithm shall converge, when it will converge and whether it will diverge on reaching the minima.

It will be based on this learning rate that we will base our understanding of AdaGrad and it’s place within Gradient Descent.

Without using any optimization algorithm, like AdaGrad or Adam or RMSProp, the learning rate stays the same through every iteration. But looking at the two learning rates up top, it seems that while α = 0.05 is diverging, α = 0.005 takes a bit of time as the steps it takes while beginning the descent are small. Our ideal situation would be one in which we take big steps first, like with α = 0.05, probably the first three or four steps taken by it and then slow our pace and converge like α = 0.005, and this is the exact intuition behind using Momentum to tune the hyperparameter α during training.

Another Intuition behind using Momentum :

Suppose we have a single layer perceptron with four inputs x1,x2,x3,x4 and a bias term b.

And now suppose that out of these 4 input features, one of them, say x3, is sparse. That is, out of 1000 data points we have for training, for 950 of them x3 is 0. This can cause severe problems if x3 is an important parameter on which the output is heavily dependent.

Equations governing Forward Propagation

As the updates to the weights are governed by the gradients, and the gradients depend directly on the input value, we can see that for 950 out of 1000, ∇W3 will be 0 and thus, W3 will not receive updates as frequently as W4.

One approach towards solving this problem can be to take different learning rates for each parameter which takes care of the frequency of features but this seems a bit too menial and can simply not scale well for large neural nets where we may have thousands of input features.

Using Momentum in Tuning the Learning Rate :

Momentum consists of taking in consideration the previous steps that we have taken using gradient descent. The intuition is that we must decay the learning rate for a parameter in proportion to their update history.

By this approach, what we are essentially doing is to make updates inversely proportional to their update history. Therefore, after taking huge steps as we have in α = 0.05, the updates will naturally slow down so that we have smaller and smaller steps being taken until we reach convergence.

By the same intuition, in case of α = 0.005 since the steps taken are small initially, the updates are not penalized to that extent and left free( but it may run into a vanishing gradient problem, as we shall see later)

AdaGrad :

AdaGrad uses the following two equations to tune the learning rate.

As shown in the second equation, here, the learning rate becomes :

The Vt in the first equation is a history vector that accumulates the gradients over the training iterations. Note that we square ∇(wt)̂2 as we are only concerned with the magnitude of the gradients and not the directions themselves.

And as we divide the learning rate α or η by Vt, we will be lowering the learning rate, the longer and bigger steps we take previously and vice versa.

ε is a term often called the fudge factor which is just a small number added to combat numerical instability.

As seen here, AdaGrad can offer stability in Gradient Descent but it can also be outperformed by other algorithms like AdaDelta and RMSProp.(Source : Sebastian Ruder )

Shortcomings of AdaGrad :

AdaGrad was introduced in 2011 by Duchi et. al. and many other optimization techniques have been introduced after that, including RMSProp, AdaDelta, Adam etc.

RMSProp was introduced to address a specific shortcoming of AdaGrad which is, it may easily run into a situation where the accumulated gradients in the denominator is so high that it nullifies the entire update term and Gradient Descent itself stops.

RMSProp works by the introducing a discount factor β (= 0. 95 or so) which tells us how much the new gradients should be used to update the history vector Vt itself. This keeps the term Vt in check and it does not explode, causing Gradient Descent to slow down.

Thus, though AdaGrad is a useful algorithm and can help us control our hyperparameters well, it more often than enough fails to deliver proper results due to gradients getting killed because of an exploding history of gradients. Various methods exist to ease this procedure.

In closing, here’s a nice quote about Gradient Decent. May you be able to optimize your network well and reach convergence!

Life is like Gradient Descent, it can sometimes get stuck at local optima but only hard work will take to you to the most successful point.
- Aditya Thakker