Modifying Adam to use Nesterov Accelerated Gradients: Nesterov-accelerated Adaptive Moment Estimation (Nadam)

Roan Gylberth
Konvergen.AI
Published in
4 min readJun 24, 2018

--

On the previous post, we have discussed Adam algorithm, which could be seen as the way to combine the advantage of Momentum method and Rmsprop with several distinctions. But as we have discussed earlier, Nesterov Accelerated Gradient (NAG) Method method is a variation of Momentum method that has “peeking” attribute, i.e., it applies the acceleration to the parameters before computing the gradients, then do the update with the gradients computed with the interim parameters. This attribute really helps to avoid first-order optimization problem like exploding gradients, that have been discovered happened in Recurrent Neural Networks. This motivates Dozat (2016) to incorporate Nesterov Accelerated Gradients to Adam, yielding Nesterov-accelerated Adaptive Moment Estimation (Nadam).

To get the update rule of Nadam, firstly we modify the NAG update rule to be more straightforward. First, we remember in the update rule of NAG before we have to compute interim parameters θ̃ then compute the gradients using the interim parameters.

After that, we compute the update rule using the gradients of the interim parameters

We can see that we apply the momentum to the parameters twice, to get the interim parameter and to get the update rule. To incorporate NAG to Adam we have to rewrite this into

In (3), we rewrite the notation to match Adam notation in the previous post. As we can see, the update rule for current time-step not only incorporating the current gradient 𝑔𝑡 but also the momentum vector of the next time-step 𝑚𝑡. With this rewritten NAG update rule we only apply the momentum once, in the update rule only. To make it simpler to understand, we also didn’t use the warming schedule for 𝜌₁ as recommended by the author, so 𝜌₁ will be the same throughout the training.

We see the connection between this rewritten NAG and the Classical Momentum (CM) method, that is in CM, the update rule is

Since the second term in the right-hand side doesn’t depend on the current gradient, we modify this term using the Nesterov trick, yielding the update rule for the rewritten NAG

Now, we recall the update rule of Adam without the bias correction,

We can expand 𝑚𝑡 in the update rule as in the CM to

We couldn’t directly apply the Nesterov trick to the second term of the right-hand side because 𝑣𝑡 is depending on 𝑔𝑡. But if we recall, the term 𝑣𝑡 was computed by

Since 𝜌₂ generally initialized to be very large (>0.9, even 0.999 was suggested default in the original paper), the difference between 𝑣𝑡 and 𝑣𝑡-₁ will be small. So without losing too much accuracy, we could write (7) with

Now we could apply the Nesterov trick to (9), yielding

This yields the update rule of Nadam without bias correction. Finally, after we apply the bias correction, the algorithm can be written as

Nesterov-accelerated adaptive moment estimation (Nadam) Algorithm

Dozat graciously gives the code in Tensorflow and explained how the code works intuitively. I recommend his GitHub page https://github.com/tdozat/Optimization.

References

  1. T. Dozat, “Incorporating Nesterov Momentum into Adam” (2016)
  2. I. Sutskever, J. Martens, G. Dahl, G. Hinton, “On the importance of initialization and momentum in deep learning” (2013)
  3. D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization” (2014)

--

--