New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both.
The Ranger optimizer combines two very new developments (RAdam + Lookahead) into a single optimizer for deep learning. As proof of it’s efficacy, our team used the Ranger optimizer in recently capturing 12 leaderboard records on the FastAI global leaderboards (details here).
Lookahead, one half of the Ranger optimizer, was introduced in a new paper in part by the famed deep learning researcher Geoffrey Hinton (“LookAhead optimizer: k steps forward, 1 step back” July 2019). Lookahead was inspired by the recent advances in the understanding of neural network loss surfaces and presents a whole new way of stabilizing deep learning training and speed of convergence. Building on the breakthrough in variance management for deep learning achieved by RAdam (Rectified Adam), I find that combining RAdam plus LookAhead together (Ranger) produces a dynamic dream team and an even better optimizer than RAdam alone.
The Ranger optimizer is a single codebase for ease of use and efficiency (load/save and one loop handling for all param updates), integration into FastAI, and — Ranger source code is available for your immediate use. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
Why RAdam and LookAhead are complementary:
RAdam arguably provides the best base for an optimizer to build on at the start of training. RAdam leverages a dynamic rectifier to adjust the adaptive momentum of Adam based on the variance and effectively provides an automated warm-up, custom tailored to the current dataset to ensure a solid start to training.
LookAhead was inspired by recent advances in the understanding of loss surfaces of deep neural networks, and provides a breakthrough in robust and stable exploration during the entirety of training.
To quote the LookAhead team — LookAhead “lessens the need for extensive hyperparameter tuning” while achieving “faster convergence across different deep learning tasks with minimal computational overhead”.
Hence, both provide breakthroughs in different aspects of deep learning optimization, and the combination is highly synergistic, possibly providing the best of both improvements for your deep learning results. Thus the quest for ever more stable and robust optimization methods continues and by combining two of the latest breakthroughs (RAdam + LookAhead), the integration via Ranger hopefully provides another step forward for deep learning.
Hinton et al — “We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.”
This article thus builds on the previous RAdam introduction to explain what LookAhead is, and how by combining both RAdam and LookAhead into a single optimizer Ranger, new high accuracy was obtained. With literally the first 20 epoch run I tested, I achieved a new high % accuracy that I’ve personally seen, 1% above the current FastAI leaderboard mark:
More importantly, the source and usage info is available for anyone to put Ranger to work and see if it doesn’t improve your deep learning results in terms of both stability and accuracy!
Let’s thus delve into the two components that drive Ranger — RAdam and LookAhead:
1 — What is RAdam (Rectified Adam):
I’ll refer you to my previous article that goes into much more detail on RAdam here. However, the short summary is that the researchers who developed RAdam investigated why adaptive momentum optimizers (Adam, RMSProp, etc). all require a warmup or else they tend to shoot into bad/suspect local optima near the start of training.
The cause was found to be excessive variance at the start of training when the optimizer simply has not seen enough data to make accurate adaptive momentum decisions. Warmup thus serves to reduce variance at the start of training…but even deciding how much warmup is enough requires hand tuning and varies dataset to dateset.
Thus, Rectified Adam was built by determining a ‘warmup heuristic’ using a rectifier function that is based on the actual variance encountered. The rectifier dynamically turns off and/or ‘tamps down’ the adaptive momentum so that it’s not jumping at full speed until the variance of the data settles down.
By doing so, the need for a manual warmup is avoided and training is stabilized automatically.
Once the variance has settled down, RAdam basically becomes Adam or even SGD equivalent for the rest of training. Thus, RAdam’s contribution is at the start of training.
Readers noted that in the results section that while RAdam outperforms Adam…over the very long run, SGD can eventually catch up and can surpass the final accuracy of RAdam and Adam.
That’s where we now turn to LookAhead, to integrate a new exploration mechanism that can outperform SGD even after 1000 epochs.
2 — Lookahead — the buddy system for exploring the loss surface = faster, more stable exploration and convergence.
As the researchers for LookAhead note, currently, most successful optimizers build on an SGD base by adding either
1 — adaptive momentum (Adam, AdaGrad) or
2 — a form of acceleration (Nesterov momentum or Polyak Heavy Ball)
to improve the exploration and training process, and ultimately convergence.
LookAhead however, is a new development that maintains two sets of weights and then interpolates between them — in effect it allows a faster set of weights to ‘look ahead’ or explore while the slower weights stay behind to provide longer term stability.
The result is reduced variance during training, and much less sensitivity to sub-optimal hyper-parameters and reduces the need for extensive hyper-parameter tuning. This is done while achieving faster convergence on a variety of deep learning tasks. In other words, it’s an impressive breakthrough.
By way of simple analogy, LookAhead can be thought of as the following. Imagine you are at the top of a mountain range, with various dropoff’s all around. One of them leads to the bottom and success, but others are simply crevasses with no good ending.
To explore by yourself would be hard because you’d have to drop down each one, and assuming it was a dead end, find your way back out.
But, if you had a buddy who would stay at or near the top and help pull you back up if things didn’t look good, you’d probably make a lot more progress towards finding the best way down because exploring the full terrain would proceed much more quickly and with far less likelihood of being stuck in a bad crevasse.
That’s basically what LookAhead does. It keeps a single extra copy of the weights, then lets the internalized ‘faster’ optimizer (for Ranger, that’s RAdam) explore for 5 or 6 batches. The batch interval is specified via the k parameter.
LookAhead then takes the difference between it’s saved weights and RAdam’s latest weights once the k interval is hit, and multiplies that by an alpha param (.5 by default) at every k batches, and updates the weights for RAdam.
The result is in effect a fast moving average from the internal optimizer (in this case RAdam) and a slower exponential moving average via LookAhead. The fast one explores but the slow one serves as the pull-back or stability mechanism — normally staying behind while the faster average explores, but in some cases pushing the faster one down a more promising slope while the faster one continues to explore.
By virtue of having the safety of LookAhead, the optimizer can more fully explore the landscape without as much concern over becoming stuck.
This approach is completely different than the two main methods currently in use — adaptive momentum or ‘heavy ball’ /Nesterov type momentum.
Thus, LookAhead end’s up out-exploring and finding ‘the way down’ faster and more robustly due to enhancing training stability, and thus outperforming even SGD.
3 — Ranger — an integrated codebase for one optimizer using RAdam and LookAhead together.
Lookahead can be run with any optimizer for the ‘fast’ weights — the paper used vanilla Adam, as RAdam was not even available a month ago.
However, for ease of code integration with FastAI and general simplicity of usage, I went ahead and merged both into one single optimizer, named Ranger (the RA in Ranger for homage to Rectified Adam, and Ranger as a name overall since LookAhead is outstanding at exploring the loss terrain, just like a real Ranger.)
4 — Put Ranger to use today!
There are several implementations of LookAhead on github, I started with the one by LonePatient as I liked it’s code succinctness and then built on that. RAdam is of course, from the official RAdam github codebase.
The source file for Ranger is here:
Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead in one codebase. (I used LonePatient's…
1 — Copy ranger.py to your working directory.
2 — import ranger:
3 — Create a partial to prep Ranger for use in FastAI, and point the learner’s opt_func to it.
4 — Start testing!
k parameter :— this controls how many batches to run before merging with the LookAhead weights. 5 or 6 are common defaults, I believe up to 20 was used in the paper.
alpha = this controls the percentage of the LookAhead difference to update with. .5 is the default. Hinton et al make a strong proof that .5 is probably ideal, but may be worth brief experimentation with.
One future idea the paper mentioned may be to put k and/or alpha, on a schedule based on how far training has proceeded.
Notebook coming: Based on feedback from the RAdam article, I’m planning to add a notebook shortly that will let you quickly use Ranger with ImageNette or other dataset, and thus make it easy to play around with Ranger/RAdam/LookAhead.
(I planned to release it today but have been getting constantly pre-empted by my current GPU provider the past two days and didn’t want to further delay this article..).
Two separate research teams have produced new breakthroughs towards the goal of a fast and stable optimization algorithm for deep learning. I find that by combining both of these, RAdam + LookAhead, a synergistic optimizer (Ranger) is produced, and validate with a one run new high for the ImageNette 20 epoch score.
Further testing will be needed to optimize the k param and learning rates for RAdam with LookAhead, but LookAhead and RAdam both reduce the amount of manual hyper-parameter tuning previously needed to achieve state of the art results and should help you on the path to new state of the art results in your training.
Ranger is available for you to immediately test out here, and see if this dynamic duo of RADam + LookAhead inside Ranger doesn’t further improve your deep learning results!
Update: Further testing shows that using Ranger plus the new Mish activation function (instead of ReLU) yields even better results. Details on Mish here:
Credit and links:
Lookahead implementation used — credit to LonePatient:
RAdam code (PyTorch):
Links to the respective papers:
LookAhead (Zhang, Lucas, Hinton, Ba ) — “Lookahead Optimizer: k steps forward, 1 step back”
RAdam (Liu, Jiang, He, Chen, Liu, Gao, Han) — “On the Variance of the Adaptive Learning Rate and Beyond”