Adaptive Method Based on Exponential Moving Averages with Guaranteed Convergence ; AMSGrad and AdamNC

Roan Gylberth
Konvergen.AI
Published in
6 min readJul 21, 2018

From all optimization algorithms that we have covered in this series, Adam and RMSProp are very popular among Deep Learning practitioner. One reason for this popularity is the adaptive learning rate that allows us to not worry about the learning rate in every dimension. This adaptive method firstly introduced with Adagrad, but as we have discussed, the accumulation of squared gradients from the beginning of training can decrease the effective learning rate very fast that the training will stop before reached the desired solution. This problem was solved by the adaptive method based on exponential moving averages like RMSProp and Adam. These methods accumulate the squared gradients in a window, instead of from the beginning of training.

However, in the recent paper by Reddi et al., it shows that these methods have a drawback, i.e., they can fail to converge in some cases. One intuitive reason is that some mini-batch that rarely appear may yield large informative gradients, but with the exponential averaging fashion of the algorithms, this information phased out quickly. To overcome this problem, they propose a method that introduces “long-term memory” of past gradients.

Before we get into the algorithm, we look deeper into the convergence problem of Adam. To help our analysis, let’s look at the generic adaptive method setup in Algorithm 1 of the paper.

Other algorithms that we have discussed falls in this framework, for instance, if we set

yields the Stochastic Gradient Descents algorithm. Also if we set

yields the Adagrad algorithm. Now if we recall from our Adam article, we can set 𝜙 and 𝜓 accordingly, that is

From the framework, the fundamental flaw of Adam and other exponential methods lies in the change in the inverse of learning rate with respect to time, that captured in

We see that for SGD and Adagrad, this value is positive semi-definite. For instance in SGD,

the quantity is positive semi-definite since 𝛤 is always bigger than or equal to 0 at every t. However, in Adam and other exponential moving average methods, this quantity can potentially be indefinite. This is because in the exponential moving average methods, when the large gradients are phased out, the quantity V𝑡₊₁ become much smaller than V𝑡, and the resulting quantity becoming negative.

Formally, the authors show that in several settings Adam failed to converge to an optimal solution. In Theorem 1, they show that there is an online convex optimization problem where Adam has a non-zero average regret. This result is strengthened with Theorem 2, which shows that with constant 𝜌₁ and 𝜌₂, momentum or regularization via 𝜀 will not help the algorithm to converge to an optimal solution. Theorem 3 shows that there is a stochastic convex optimization problem for which Adam does not converge to the optimal solution. The consequences of these results are that we have to use specific 𝜀, 𝜌₁, and 𝜌₂ differently in a high dimensional problem to avoid bad convergences. This defeats the purpose of adaptive methods that we have discussed earlier. They also show why bad convergence contradicts the convergence analysis in Adam paper, i.e., the author of Adam erroneously assumes that 𝛤 is positive semi-definite. To mitigate this problem, Reddi et al. modify the algorithm to satisfy the assumption, thus introducing AMSGrad.

The step-by-step update rule of AMSGrad is quite similar to Adam, first we compute the gradient by

then we compute the moment estimate the same way as Adam

Now the main difference from Adam is this

Notice that we keep the information of the maximum value of all 𝑣𝑡 until current time step and use it in the update rule compared to just the current 𝑣𝑡 in Adam algorithm. This allows the previously discussed informative gradients to still become relevant instead of being phased out in Adam algorithm. This restricts the step size from increasing because the denominator will be at least the same or higher than the previous value, avoiding the pitfall of Adam and other exponentially moving average methods since the quantity 𝛤 is now positive semi-definite. In this step-by-step update rule, we use constant 𝜌₁ compared to decreasing 𝜌₁ in the Algorithm 2 of the paper that needed for proving the convergence. This is because in practice we typically use constant 𝜌₁. We also omit the projection operation for simplicity.

We can compare AMSGrad to Adam and Adagrad to get more intuition about the update. Suppose at a particular time step 𝑡 and dimension 𝑖, we have

In AMSGrad, the learning rate will not be affected by this and will not change. In contrast, by Adam update rule,

Because the learning rate is inversely proportional to 𝑣𝑡, then the learning rate will increase. As we have seen, this could be detrimental to the overall performance of the algorithms. Now let’s compare to Adagrad,

it slightly decreases the learning rate, which often leads to poor performance in practice because over a long time period it can significantly decrease the learning rate.

In this paper, the authors carefully construct examples where Adam failed to converge to an optimal solution. However, they argue that it is not unrealistic to imagine scenarios where such an issue can at the very least slow down convergence. They also provide other algorithm called AdamNC, that can achieve good converge rates. This algorithm modifies the update rule that uses increasing schedule of 𝜌₂ and decreasing schedule of 𝜌₁. The difference can be seen in this moment estimation phase of the algorithm.

This modification using an assumption in the selection of the learning rate and 𝜌₂ such that the quantity of 𝛤 is positive semi-definite. The authors argue that the result can be generalized to deal with the case where the assumption is violated as long as the violation is not too large or too frequent.

References:

  1. S. J. Reddi, S. Kale, S. Kumar “On the Convergence of Adam and Beyond” (2018)
  2. D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization” (2014)

--

--