Less Wright
Aug 15 · 5 min read

A new paper by Liu, Jian, He et al introduces RAdam, or “Rectified Adam”. It’s a new variation of the classic Adam optimizer that provides an automated, dynamic adjustment to the adaptive learning rate based on their detailed study into the effects of variance and momentum during training. RAdam holds the promise of immediately improving every AI architecture compared to vanilla Adam as a result:

RAdam is robust to various learning rates while still converging rapidly and achieving greater accuracy (CIFAR dataset)

I have tested RAdam myself inside the FastAI framework, and quickly achieved new high accuracy records versus two of the hard to beat FastAI leaderboard scores on ImageNette. Unlike many papers I have tested this year where things only seem to work well on their specific datasets used in the paper, and not so well on new datasets I try it with, it appears RAdam is a true improvement and likely to be the permanent successor to vanilla Adam imo.

RAdam and XResNet50, 86% in 5 epochs
Imagenette Leaderboard — current high = 84.6%

Thus, let’s delve into RAdam and understand what it does internally, and why it holds the promise of delivering improved convergence, better training stability (much less sensitive to chosen learning rates) and better accuracy and generalization for nearly all AI applications.

Not just for CNNs: RAdam outperforming with language modeling LSTM on Billion Word Dataset

The goal for all AI researchers — a Fast and Stable Optimization algorithm…

The authors note that while everyone is working towards the goal of having fast and stable optimization algorithms, adaptive learning rate optimizers including Adam, RMSProp, etc. all suffer from a risk of converging into poor local optima — if a warm-up method is not implemented. Thus, nearly everyone uses some form of warmup (FastAI has a built in warmup in it’s Fit_One_Cycle)…but why is a warmup needed?

With the current limited understanding of the underlying reason or even best practices to the warmup heuristic in the AI community, the authors sought to uncover the underpinnings for this issue. They find that the root issue is that adaptive learning rate optimizers have too large of a variance, especially in the early stages of training, and make excessive jumps based on limited training data…and thus can settle into poor local optima.

Hence, warmup (an initial period of training with a much lower learning rate) is a requirement for adaptive optimizers to offset excessive variance when the optimizer has only worked with limited training data.

Here is a visual to show what happens initially to Adam with no warmup — within 10 iterations, the gradient distribution is rapidly perturbed:

Note how the initial normal distribution is rapidly distorted without a warmup…

In short, vanilla Adam and other adaptive learning rate optimizers make bad decisions based on too little data early on in training. Thus, without some form of warmup, they are likely to initially fall into bad local optima making the training curve longer and harder due to a bad start.

The authors then tested running Adam with no warmup, but avoiding any use of momentum for the first 2000 iterations (adam-2k). They found that similar results as Adam plus warmup were achieved, thus verifying that warm-up functions as a ‘variance reduction’ during initial training and avoids Adam jumping into bad optima at the start when it does not have enough data to work with.

The ‘rectifier’ in RAdam:

Given that warm-up serves as a variance reducer, but the degree of warmup required is unknown and varies dataset to dataset, the authors then moved into determining a mathematical algorithm to serve as a dynamic variance reducer. They thus built a rectifier term, that would allow the adaptive momentum to slowly but steadily be allowed to work up to full expression as a function of the underlying variance. Their full model is this:

RAdam at it’s core. The blue box highlights the final application of the rectifier r(t) to the step size.

The authors note that in some cases RAdam can degenerate to SGD with momentum equivalent, driven by the decay rate and underlying variance.

The summary though is the RAdam dynamically turns on or off the adaptive learning rate depending on the underlying divergence of the variance. In effect, it provides a dynamic warmup with no tunable parameters needed.

The authors confirm that RAdam outperforms the traditional manual warmup tuning where the number of steps of warmup required, have to be surmised or guessed at:

RAdam automatically provides variance reduction, outperforming manual warmups under a variety of warmup length’s and various learning rates.

Summary: RAdam is arguably the new state of the art optimizer for AI

As you can see, RAdam provides a dynamic heuristic to provide automated variance reduction and thus removes the need and manual tuning involved with a warmup during training.

In addition, RAdam is shown to be more robust to learning rate variations (the most important hyperparameter) and provides better training accuracy and generalization on a variety of datasets and within a variety of AI architectures.

In short, I’d highly recommend you drop RAdam into your AI architecture and see if you don’t get an immediate benefit. I’d offer a money back guarantee but since the cost for it is $0.00… :)

RAdam is available for PyTorch at their official github here.

FastAI users can easily plug RAdam in as follows:

Import RAdam, declare as a partial function (add params as desired)
Over-ride the default AdamW optimizer with RAdam via the partial declared earlier.

The full paper is linked below, and goes into greater mathematical detail than I have written here. However, this is one case where I’d recommend skipping the paper and use the time to just put RAdam to the test — drop RAdam in and test it out as I think the odds of an immediate improvement in your models is very high.

“On the Variance of the Adaptive Learning Rate and Beyond” (link to RAdam paper)

It’s not often I get to read a paper and test it out and see immediate improvements, so I’d greatly encourage you to put RAdam to use and please leave any comments about your results and/or questions below!

Update — I find that by combining RAdam + LookAhead (Hinton, July 19), an even better optimizer (Ranger) is created as each address different aspects of training optimization. RAdam stabilizes early training, LookAhead stabilizes further training, exploration and convergence.

Details here: https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d

Less Wright

Written by

FastAI, PyTorch, Deep Learning. Stock Index investing and long term compounding.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade