How to achieve Super-Convergence and exploit One-Cycle policy: a simple guide

Published in

Kirey Group

8 min readJan 12, 2021

A road through Leslie N. Smith’s works, to understand the intuition and application of the One-Cycle policy, the Hyper-parameters settings in a Deep Neural Network and catch Super Convergence.

This post provides a guide and an intuition into choosing the appropriate values for training a Deep Neural Network. In particular, it shows a detailed overview of the phenomenon called Super-Convergence where a Deep Neural Network can be trained in order of magnitude faster compared to conventional training methods. The key elements follow the One-Cycle policy and Leslie N. Smith’s teachings and Hyper-Parameter settings.

In this session, the papers that most interest us are:
- “Cyclical Learning Rates for Training Neural Networks” [Leslie N. Smith]
- “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates” [Leslie N. Smith, Nicholay Topin]
- “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay” [Leslie N. Smith]
For a fascinating study, I suggest a reading of them.

Let’s take a step back and review the basics

the Learning Rate (LR) indicates how much I change the network weights at each iteration or rather the step size of the modification of the weights in a Deep Neural Network. This value constitutes one of the most delicate and important hyperparameters to tune to achieve excellent performance.

To understand the importance of the learning rate, we have to remember that a neural network is represented by a set of parameters called Weights whose value, initialized randomly, must be properly calculated in order to minimize a Cost Function (also called Objective Function). In other words, minimize the error.

As known, the weights cannot be calculated analytically. Instead, they must be determined using an optimization procedure.
This hides some pitfalls. The optimization problem, that the chosen procedure tries to solve, has a space of solutions (set of weights) which can be:

excellent, the global minimum points (global optima)
bad, easier to find but of lesser relevance such as local minima (local optima)

Therefore, the Learning Rate determines how much to update the network weights in order to obtain the best performance.

Figure 1— Different Learning Rates in the optimization problem

In general, as shown in Figure 1:

a LR too low will take a long time to converge. This is especially true if there are many saddle points in the loss space.

In a saddle point, there is a point of maximum relative to a plane and a minimum relative with respect to another plane. In particular, in Figure 2, the plane that highlights the yellow curve shows the stationary point as the minimum point. On the other hand, the red point is the maximum point for the green curve.
It is difficult to move away from a saddle point. If the LR is very low, it can slow down learning.

on the other hand, with a high LR, there is a risk of “jumping” from one side to the other without actually finding the global minimum point and the best configurations
in the latter case, even a too high learning rate can lead to divergence.

Cyclical LR & LR Range Test: how to set the LR?

The common practice is to set the LR to a constant value and gradually decrease it by an order of magnitude once the accuracy has stabilized.
If you use an initial constant learning rate that is greater or less than the optimal LR, the model performance is reduced.

The only way to find an optimal LR was to perform a grid search which is nothing more than a test of the possible learning rates. But this is a tedious and time-consuming process.

Smith’s thought was to vary the LR between a minimum and a maximum value during the training, thus obtaining the Cyclical Learning Rate process.

Indeed, it certainly seemed easier to choose an interval close to the optimal LR than to directly find the optimal one. Since the learning rate changes between these limits, part of the training is spent close to the optimal value.

However, it remains to be decided which values to attribute to the limits: the minimum and maximum LR. But other Smith’s studies show that a single run in which the LR is increased from a small value to a large value provides worth information to choose minimum, maximum, and optimal LR.
This method has been called the Learning Rate Range Test.

Figure 3 — LR Range Test, plot Loss x Lr

It’s relatively simple: in a single test run, you start with a very low LR, for which you run the model and calculate the loss on the validation data. In an iterative way, the LR value is increased exponentially. As in Figure 3, you can plot the results in a diagram that represents the loss and the Learning Rate. The x-value representing the lowest y-value, which is the lowest loss, represents the optimal learning rate for the training data.

This provides an overview of the goodness we can achieve training the neural network over a range of learning speeds. With a low LR, the network begins to converge, and as the LR increases, it eventually gets too large and causes the test loss to diverge.

Super Convergence: what is it & how to achieve it?

So, let’s get to the point and try to get into the phenomenon that Smith called Super-Convergence where we can train a neural network much faster in order of magnitude than conventional training methods. One of the key elements is the One-Cycle policy with the maximum possible learning rate.
The intuition is easily described as follow:

An insight that allows “Super Convergence” in training is the use of large learning rates that regularizes the network, hence requiring a reduction of all other forms of regularization to preserve a balance between underfitting and overfitting.

So, which is the suggested technique?

To achieve superconvergence, we use the One-Cycle LR policy which can be worked out as a special case of a Cyclical LR. This requires you to specify the minimum and maximum LR.

The LR test Range provides the maximum learning rate and the minimum is generally set equal to 1/10 of the maximum value.

The suggestion is to cycle the LR between the lower limit and the upper limit. Conventionally, the learning rate decreases when training begins to converge over time.

As shown in Figure 4, the cycle corresponds to a blue triangle. It is the number of iterations that go from the lower limit (base_lr) to the upper limit (max_lr) and back to the lower limit. The end of a cycle may not coincide with the end of an epoch, but it usually does. Stepsize is half of the cycle. So the Stepsize is the number of iterations in which we want the LR to go from one limit to the other.

One-Cycle policy: What are the differences?

As the name intuitively suggests, applying the One-Cycle policy we have a single cycle.
Smith recommends performing a LR cycle of 2 steps of equal length. We choose the maximum LR using the Range Test. Then, we set a lower LR such as 1/5 or 1/10 of the maximum LR.
In this case, the cycle length is slightly less than the total number of epochs to train. The last remaining iterations form the annihilation phase, in which we reduce the LR even farther below the value of the lowest minimum LR (1/10 — 1/100).

What is the motivation behind this cyclical choice?

The reason behind this can be explained intuitively. During the middle of learning, when the LR is higher, it works as a regularization method and prevents the network from over-adapting. This helps the model to avoid the steep areas of the loss function and to prefer flatter areas instead.
For clarity, we define as regularization any modification we make to a learning algorithm intended to reduce its generalization error.
Insightfully, the LR is initially small to allow the beginning of convergence but when the model crosses a flat area, the LR becomes large, allowing for faster progress and skipping difficult stationary points. In the final stages of training, when the training must stabilize at the local minimum (ideally global minimum), the LR is again reduced to an increasingly smaller value, reaching the minimum LR and then entering the annihilation zone.

Other Hyper-parameters: how to set the leftovers?

Batch Size

The Batch Size defines the number of samples that will be propagated across the neural network.
How to choose this parameter? There are advantages of using a batch size smaller than the number of the available samples:

Requires less memory. Because you train your network using fewer samples, your overall training routine requires less memory. This is especially important if you have problems with your machine’s memory.
Neural networks are commonly thought to train faster with mini-batches. This is because we update the weights after each propagation.

On the other hand, the disadvantages of using a batch size smaller than the number of the available samples:

The smaller the batch, the less accurate the update will be. This is easily explicated because we consider a part of the total (a sample).
For the same reason, the update could be more influenced by particular cases.

It is common to think that small batch sizes induce regularization effects if associated with adequate parameters (precisely, not a high learning rate), but on the contrary to previous studies, Smith suggests to use a larger batch size when the One-Cycle policy is applied. The batch size should only be limited due to memory constraints, as a larger batch size allows us to use a higher learning rate. However, the benefits of larger batch sizes decrease after a certain point. A traditional grid search could be done.

Momentum

The effect of Momentum and Learning Rate on training dynamics are closely related as the one depends on the other. Momentum is designed to accelerate the training of the model, its effect on updating weights is of the same magnitude as the learning rate.

In order to achieve the Super-Convergence, the optimal training procedure is a combination of an increasing cyclical LR and a decreasing cyclical Momentum. The maximum value of cyclic Momentum can be chosen after performing a grid search for a few values (such as 0.9, 0.95, 0.97, 0.99), choosing the one that provides the best accuracy in the test. Smith also observed that, generally, the final results are almost independent of the minimum value, so that a minimum Momentum equal to 0.85 works perfectly.

Decreasing momentum by increasing the learning rate offers three benefits:

A lower test loss
A faster initial convergence
Increased convergence stability over a wider range of LR

“Read, experiment, and think constantly” — Leslie N. Smith