Adam: The Birthchild of AdaGrad and RMSProp

Kaivalya Tota
8 min readApr 22, 2020

--

Mat Ruiz and Kaivalya Tota

Adam is an optimization algorithm which has been rising in popularity in the deep learning field because of its ability to effectively and quickly achieve good results, serving as an alternative to stochastic gradient descent. Standing for Adaptive Moment Estimation, Adam computes individual adaptive learning rates for different parameters based on estimates of the first and second moments of gradients. Adam was presented in 2015 by Diederik Kingma and Jimmy Ba who described it as a combination of Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp) with Momentum. Beyond AdaGrad and RMSProp, Adam uses square gradients to scale the learning rate and implements Momentum by using the moving average of the gradient rather than the gradient itself.

Always keep in mind that Adam continuously adapts the learning rate for each weight of the neural network! Now let’s dive into the workings of this algorithm dubbed gradient descent on “steroids”!

Refresher

Okay so before we get into more detail with RMS Prop and Momentum, let’s remind ourselves of the general goal of gradient descent.

Gradient descent is the process by which we take downhill steps in hopes of finding the global minimum of a function. A minimum of a function is anywhere where the gradient (slope) is equal to 0, or horizontally flat. Remember that in our case, we are trying to find the absolute minimum of our algorithim’s loss function — that is, we want to reach the global minimum of the loss function to achieve the least amount of error (loss) in our ML algorithm as possible.

So what does this process look like? Imagine a ball rolling down a series of hills as shown below:

The ball rolling is our model in the process of “training” and finding the weights and biases that give us minimum error. Minimum error is achieved when the ball settles at the global minimum.

But wait! Gradient descent itself always isn’t perfect. Recall that one of gradient descent’s biggest problems is accidently getting stuck in local minima, where our loss can still be HUGE:

This is where Momentum comes into play.

Momentum

Momentum adds onto gradient descent by considering previous gradients (the slope of the hill prior to where the ball is currently at). So in the previous case, instead of stopping when the gradient is 0 at the first local minimum, momentum will continue to move the ball forward because it takes into consideration how steep the slope before it was. It works just like you think it would:

Momentum is all about speeding up and smoothening the process of gradient descent. Notice how the ball is “speeds” up after steeper slopes. That’s momentum taking into consideration previous steep gradients and convincing itself to continue moving, regardless of the local minimum. Momentum is a good way to prevent getting stuck in local minima.

Now to get more mathy, let’s look at these two equations to get a better understanding of momentum. Since momentum constantly considers previous gradients, we can say that momentum calculates moving averages. That is, it considers a skewed average of all past gradients. It’s considered “moving” because it constantly changes which past gradients to consider more than others. It puts more value on recent gradients (more immediately previous slopes), rather than gradients that happened a long time ago (a.k.a. the starting point atop the hill).

In this case, m and v are both calculating moving averages for momentum where β1 and β2 are “decaying” hyperparameters (a number) that YOU get to decide. However experiments done by the ML pros show that setting β1 to 0.9 and β2 to 0.999 provides for the best results. Notice how both equations consider the previous time iteration (mt-1 and vt-1), that’s momentum considering past gradients g. We can visualize this by expanding the m equation for numerous timsteps (epochs) t, where we go all the way to t=3:

Notice how each iteration of m considers all past gradients in the t interval. And with β1 being 0.9, each successive call of β1^n as n increases, will “decay” the value/importance of an aging gradient with t (eg: 0.9 * 0.9 = 0.81).

So momentum fixes all of gradient descent’s problem’s right?!? What more can we do? Let’s now consider RMS Prop.

RMSProp (derived from AdaGrad)

Recall our vanilla version of gradient descent:

Notice how the ball oscillates in the valley of the global min for a painstakingly long time. Annoying right? RMSProp helps us with this. RMSProp is a derivation of AdaGrad, a stochastic gradient descent algorithm that has different learning rates for each of its parameters (variables). The major difference RMSProp has with AdaGrad is that the gradient gt is calculated by an exponentially decaying average, instead of the sum of its gradients. In a sense, RMSProp basically “slows” down the ball near the global minimum, by adjusting and adapting the learning rate accordingly. So how exactly does it do this?

Let’s start by looking at this picture of climbers descending Mount Everest.

Photo of climbers descending Mount Everest. PC: Extreme Summit Team

In the case of descending Mount Everest, our function is made up of variables x and y. X is going forward or backwards along the ridge. Y is either going left or right (off the cliff!).

If our machine learning model is one of these climbers, then we want to AVOID going left or right in the Y axis to prevent a fateful death (our ML model exploding). So we adapt and set our learning rate (step-size) αy for the Y variable very low. On the other hand, we want to set the learning rate αx for X to be a reasonable number, so that we continue going the better way along the ridge towards the global minimum of Mount Everest. We don’t want to take too big of a step in the wrong direction or else our algorithm will explode! (known as exploding gradients)

This is what RMSProp is all about. It adapts and caters variable-specific learning rates automatically to prevent the model’s overall learning rate from stepping off the edge and bouncing all over the place:

Notice how once the ball gets near the global minimum — instead of oscillating all over the place — the ball’s learning rate is adjusted and calmly settles at the global min. In more technical terms, if the gradient at time t will make the ball oscillate and bounce a ton, then its learning rate is adjusted accordingly. RMSProp helps gradient descent by settling at the global minimum quickly without thinking twice.

And just like momentum, RMSProp uses a squared moving average to consider the squares of past gradients, and uses the moving average to help update the weights.

Just like with momentum, m describes moving average where β1 and β2 are decaying hyperparameters, g represents the calculated gradient at time t, and (dEdW) is the derivative of error with respect to the weights. The second equation wt is used to update our algorithm’s weights in preparation for the next time step (epoch). The funny looking n (nabla) serves as our learning rate, and the funny e (epsilon) is a teeny-tiny number (10^-8) that helps us prevent dividing from 0 if m happens to be 0 (keeps our computer happy :)).

Now that we now know two new things:

  1. Momentum helps skip local minima and speeds up gradient descent

2. RMSProp let’s us snuggle into the global minimum quickly

What now?

Adam

Now for our guest celebrity performer tonight, welcome Adam! Adam is the best of both worlds; he’s the combination of Momentum and RMSProp. He’s fast, and is quick to settle down. He doesn’t dwell on his losses, instead, he gets over them quickly and keeps moving until he knows it’s finally time to settle down:

What a sight. It’s what gradient descent has been waiting for all this time.

So how does Adam do this? We know Adam combines Momentum and RMSProp, but how exactly? Let’s look at the algorithm, you should notice some familiar equations:

Let’s break it down:

  • ɑ (alpha): The learning rate/step size meaning the proportion which weights are updated.
  • β1: The exponential decay rate for the first moment estimates.
  • β2: The exponential decay rate for the second moment estimates.
  • Within the while-loop:
  • Computation of the gradient. (Line 2)
  • Computation of the running average of the gradient. (Line 3)
  • Computation of the running average of the gradient squared. (Line 4)
  • Correcting the bias for the two moments, but why do we have to do this? Since the moments were initialized with 0, they are biased towards 0. (Line 5 and 6)
  • Lastly, the parameters are updated! (Line 7)

Conclusion

Adam is a gradient-based optimization algorithm, making use of the stochastic gradient extensions of AdaGrad and RMSProp, to deal with machine learning problems involving large datasets and high-dimensional parameter spaces. By combining the advantages of these extensions, Adam avoids issues such as with AdaGrad’s drastic diminishing in the learning rate from failing to scale the denominator. Additionally, authors Kingma and Ba have identified the benefits of Adam as:

  • Straightforward to implement
  • Computationally efficient
  • Little memory requirements
  • Appropriate for problems with sparse gradients (ability of AdaGrad)
  • Appropriate for non-stationary objectives (ability of RMSProp)

Although Adam offers effective and quick optimization, it does have its drawbacks! For example, it does not always converge to an optimal solution, in which switching to stochastic gradient descent with momentum might be better. Nonetheless, Adam is one of the best optimization algorithms in deep learning and continues to grow in popularity!

You can also find Kingma and Ba’s original proposal here! For more machine learning and deep learning resources, check out UCLA ACM AI’s blog!

--

--