Accelerating the Adaptive Methods; RMSProp+Momentum and Adam

Published in

Konvergen.AI

4 min readMay 21, 2018

Our last post discussed two other adaptive algorithms which are an extension to the Adagrad algorithm, i.e., Adadelta and RMSProp. As we have discussed, we can view the accumulated gradients in Adadelta as the acceleration factor like Momentum method for SGD. Unfortunately, RMSProp as itself didn’t have such acceleration factor like that.

One approach to incorporate momentum acceleration in RMSProp was done by Alex Graves in his paper, Generating Sequences With Recurrent Neural Networks. First, we recall the RMSProp equation

Where 𝑣 is the locally accumulated squared gradients, 𝜌 is the decaying constant, 𝜂 is the initial learning rate, 𝑔 is the gradients, and 𝜃 is the parameter. Now, if we apply the momentum term in 𝛥𝜃𝑡 and move the denominator to 𝑔, the update become

where 𝛼 is the momentum hyperparameter. If you read the Graves paper above, you’ll notice that the denominator is quite different, in the paper the accumulated squared gradients is subtracted with the squared of accumulated gradients.

The above approach is the most straightforward way to incorporate momentum method to the RMSProp algorithm. Another algorithm that can be viewed as the combination of RMSProp and Momentum method is Adam, which name is derived from adaptive moments estimation. Its name gives hint about what the algorithm does, i.e., its exploits the moment estimates of the gradients, in particular, the first-order and second-order raw moments. To estimate the first-order moment, i.e., the mean, we use the moving averages of gradients. In the other hand, to estimate the second-order raw moments, i.e., the uncentered variance, we use the moving averages of squared gradients.

To see it with more clarity, we see the equation of moments estimate, that is

where 𝜌₁ and 𝜌₂ are the hyperparameters that control the exponential decay rates. We can just plug this estimated moments to compute the parameter update, but Kingma & Ba notices that in the initial steps these moments estimation are biased towards zero, especially when the decay rates are small. This happened because of the initial value of these estimated moments, i.e., 𝑚₀ and 𝑣₀ initialized as vectors of 0.

To counteract the bias in moments estimates, we use bias correction estimates of the moments. To show this correction, we use the second-order raw moment as the example, which can be done with the first-order moment too. If we initialize 𝑣₀=0, we can write 𝑣𝑡 as the function of gradients at all previous timestep 𝑡, that is

Since we want to know the expected value of second-order moments estimate in timestep 𝑡, so that we can find how to correct the discrepancy between that and the true second-order moments of the gradients (𝔼[g²]). We compute the expected value with

Where we can get (6) using the linearity of expectation and using simple algebra to get (7). Notice that we have a new residual value 𝜁, that is the difference between 𝑔𝑡 and 𝑔𝑖, and is equals zero if 𝑔𝑖 is stationary (didn’t change from time to time). This value can be kept small when the exponential decay rate is chosen such that the exponential moving average assign small weights to gradients too far in the past, and the authors decided to ignore this value in the algorithm. Now to get better estimates of 𝔼[g²], we divide the accumulated squared gradients 𝑣𝑡 by the term (𝟣−𝜌ᵗ₂), yielding the bias correction term

Then do the same for the first-order moment estimation

Finally, using these unbiased moments we can compute the parameter updates, that is

This yields the final update rule of the algorithm. Now we can see the differences between RMSProp+Momentum and Adam. In RMSProp+Momentum, the momentum term is computed using the gradient that has been scaled with the accumulated gradients. On the other hand, in Adam, the momentum term is computed directly to the gradients. Another difference is that unlike Adam, RMSProp lacks the bias correction factor in the second-order moment estimation. This causes RMSProp second-order moments may have bias in the early stages of training. Adam is a popular algorithm in deep learning, which used in many tutorials. Albeit that popularity, there are several extension to Adam, which will be discussed in the next post.

References

A. Graves, “Generating Sequences With Recurrent Neural Networks” (2013)
D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization” (2014)
I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning” (2016)

Accelerating the Adaptive Methods; RMSProp+Momentum and Adam

References

Written by Roan Gylberth