An Introduction to AdaGrad

Published in

Konvergen.AI

5 min readMay 3, 2018

We have discussed several algorithms in the last two posts, and there is a hyper-parameter that used in all algorithms, i.e., the learning rate (η). To refresh again, a hyper-parameter is a parameter that has to be chosen manually before training. The prefix ‘hyper-‘ is to differentiate hyper-parameter to the parameter that was changed automatically by the optimization algorithms during training. The learning rate reflects how much we allow the parameter (θ) to follow the opposite direction of the gradient estimate (g).

Unfortunately, this hyper-parameter could be very difficult to set because if we set it too small, then the parameter update will be very slow and it will take very long time to achieve an acceptable loss. Otherwise, if we set it too large, then the parameter will move all over the function and may never achieve acceptable loss at all. To make things worse, the high-dimensional non-convex nature of neural networks optimization could lead to different sensitivity on each dimension. The learning rate could be too small in some dimension and could be too large in another dimension.

One obvious way to mitigate that problem is to choose different learning rate for each dimension, but imagine if we have thousands or millions of dimensions, which is normal for deep neural networks, that would not be practical. So, in practice, one of the earlier algorithms that have been used to mitigate this problem for deep neural networks is the AdaGrad algorithm (Duchi et al., 2011). This algorithm adaptively scaled the learning rate for each dimension. Before we explore the algorithm and the how it works, let’s look at the equation for the parameter update that have been used in practice

where θ is the parameter to be updated, η is the initial learning rate, ε is some small quantity that used to avoid the division of zero, I is the identity matrix, gt is the gradient estimate in time-step t that we can get with the equation

The key of this algorithm is in the matrix G𝑡, which is the sum of the outer product of the gradients until time-step 𝑡, which is defined by

Actually, we can use the full matrix G𝑡 in the parameter update, but computing the square root of the full matrix is impractical, especially in very high dimension. On the other hand, computing the square root and the inverse of only the diagonal diag(G𝑡) can easily be done.

If we only saw the equation (1), it can be unclear how it can mitigate the sensitivity problem. To add more clarity in our understanding, we can expand the equation (1) into this form

Well, you can see that it becomes clearer, but it is too long, so if we simplify the effective learning rate parts,

so, we can simplify equation 4 with

After multiplying the effective learning rate matrix with the gradient estimate vector yields the update rule of AdaGrad

If we compare this with previously discussed algorithms, for example Stochastic Gradient Descent, which update in this form is

we can see that Stochastic Gradient Decent use same learning rate at each iteration in all dimension. On the other hand, AdaGrad adaptively scaled the learning rate with respect to the accumulated squared gradient at each iteration in each dimension.

The form of AdaGrad onequation 6 is another form that we can find, e.g., in (Goodfellow et al., 2016). Since

with slight abuse of notation we can write

where Σg²𝑡 is the sum of squared gradient estimate over the course of training and ε is the vector of small numbers to avoid dividing by zero. This equation is slightly easier to understand than equation 1, but doesn’t tell the full story since Σg²𝑡 is representing diag(G𝑡), which just the specialized case of more general case with the full matrix G𝑡.

This algorithm performs best for sparse data because it decreases the learning rate faster for frequent parameters, and slower for parameters infrequent parameter. Unfortunately, there is some case that the effective learning rate `decreased very fast because we do accumulation of the gradients from the beginning of training. This can cause an issue that there is a point that the model will not learn again because the learning rate is almost zero. This issue was mitigated by some algorithms that extend AdaGrad, and these algorithms will be the subject of next post. We need to understand how and why AdaGrad works to really understand and appreciate these algorithms. In this post, We only exploring how AdaGrad works, without looking at the regret bound of the algorithms, which you can read in its very comprehensive Journal Paper.

References

J. Duchi, E. Hazan, Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization” (2011)
I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning” (2016)

An Introduction to AdaGrad

References

Written by Roan Gylberth