Continuing on Adaptive Method: ADADELTA and RMSProp

Roan Gylberth
Konvergen.AI
Published in
6 min readMay 4, 2018

--

In our last post, we have discussed the difficulties of setting the learning rate hyper-parameter, which can be mitigated by using the Adagrad algorithm. However, this algorithm has a serious flaw, i.e., the effective learning rate decreased very fast because of the accumulation of squared gradients from the very beginning of training. This is a plus when applying Adagrad in convex functions because it allows the algorithm to converge rapidly. On the other hand, in non-convex function, Adagrad may decrease the effective learning rate too fast so it was trapped before it reaches a good locally convex structure and achieves a bad local minimum.

Another problem with Adagrad is that if the initial gradients are large, the consequent effective learning rate will be low for the rest of the training. This can be mitigated by setting the learning rate hyper-parameter large enough that the initial gradients won’t affect the effective learning rate too much for the rest of the training. This makes the Adagrad algorithm sensitive to the choice of learning rate, which defeats the purpose of the algorithm.

One algorithm that was created to combat these problems is Adadelta, that requires no initial learning rate setting and insensitive to hyper-parameters. This algorithm was built by following two main ideas, i.e., accumulating the squared gradients over moving windows, and correcting the unit of parameter updates with hessian approximations. The first idea is quite intuitive since we know that the problem lies in the accumulation of squared gradients from the very beginning of the training and can accumulate to infinity. So if we accumulate only in some portion of previous square gradients, it ensures that the accumulation won’t become too big and the training will continue even after many iterations of the updates.

The naive way to do the windowed accumulation of squared gradients is simply by accumulating the last w squared gradients. However, storing and updating the w previous squared gradients is not efficient, especially when the parameter to be updated is very large, which in deep learning could become millions of parameters. Instead, the author of Adadelta implements the accumulation as an exponentially decaying average of the squared gradients, which denoted by 𝔼[g²]. This local accumulation at timestep 𝑡 is computed by

where ρ is the decaying constant, and g𝑡 is the gradients at timestep 𝑡 that computed in the same way as the other algorithms, i.e.,

The decaying constant has the same effect as α in momentum method, which controls how much the previous accumulated squared gradients affect the effective learning rate. The decaying constant can also be viewed as the hyper-parameter to control the moving window size.

Now that we have the locally accumulated squared gradients, we take the square root of it, just like in Adagrad algorithm, and denoted it with RMS[g] which at timestep 𝑡 can be computed by

with ε is the same as the one in Adagrad, which is a small value added to avoid division by zero. This value is added because in the parameter update, RMS[g] is located in the denominator. Then we change the accumulation of squared gradient in Adagrad update rule with RMS[g], yielding

Surprisingly, if we stopped and use equation 4 as the update rule, we have another algorithm called RMSProp, another popular algorithm in deep learning. However, with this algorithm, we still need to choose the initial learning rate (η) and the unit mismatch still hasn’t been solved, which leads us to the second idea, i.e., correcting the unit of parameter updates with hessian approximations.

To have a clearer view of the unit mismatch, we could separate the update rule into two parts, for example in Adadelta case, we compute the update parameter (Δθ)

then apply the update parameter to the previous parameter

Do note that the update parameter Δθ can vary, depending on the algorithm. For example in SGD, the update parameter is

If the parameters θ has hypothetical units, then the changes in parameter Δθ should have the same units too. This is not the case in all previous algorithms that we have discussed, i.e., SGD, Momentum, Nesterov, and Adagrad. In the case of SGD, the units in Δθ is related to the gradients, which is actually proportional to the inverse of the units of θ.

To handle this issue, we can look at the second-order methods, e.g., Newton’s method that use Hessian approximation. In Newton’s method, the update parameter in Newton’s method is

Since the Hessian is computed using the second derivative of 𝑓 with respect to θ, we can see the changes in parameter have the same units as the parameters

To match the units in, we can rearrange the Newton’s method update parameter

to

Then switch H¹ in equation 7, yielding

We already know the square root of windowed accumulation, which is the RMS[g], so we only need to compute the square root of windowed accumulation of Δθ. This is similar to finding the square root of windowed accumulation of g. First, we find the local accumulation 𝔼[Δθ²] at timestep 𝑡

Then we compute the square root of it

Now, plugging RMS[g] and RMS[Δθ] to equation 10 yielding the update rule

Unfortunately, the quantity Δθ𝑡 has yet to be known, so we assume that the curvature is locally smooth and approximate Δθ𝑡 by taking the square root of windowed accumulation of the last update, i.e., Δθ𝑡-₁. So, the update parameter of Adadelta is

and the final update rule of Adadelta become

The update rule in equation 15 is cheaper in computation compared to the second order method, since it avoids the needs to compute matrix inversion that have O(n³) complexity (assuming we use Gaussian elimination) at every iterations. The numerator (RMS[Δθ]) acts as an acceleration term, accumulating previous gradients over a window of time, similar to the Momentum method. The denominator (RMS[g]) is related to Adagrad which helps to even out the update in each dimension, without the problem of diminishing updates by accumulating the squared gradient over a moving window.

References

  1. M. D. Zeiler, “Adadelta: An Adaptive Learning Rate Method” (2012)
  2. G. Hinton, N. Srivastava, K. Swersky, “Lecture 6a: Overview of mini–batch gradient descent” (2012)
  3. I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning” (2016)

--

--