The 1 Cycle Policy : an experiment that vanished the struggle in training of Neural Nets.

Discover the black-box in training with one-cycle-policy.

Shubhajit Das

Published in

Data Science Network (DSNet)

10 min readJul 19, 2019

Source: https://www.mckinsey.com/featured-insights/artificial-intelligence/applying-artificial-intelligence-for-social-good

Introduction:

Finding out a good learning rate for your data has been a crucial problem in training neural nets. Traditional practices say: “you can’t train a neural network with higher learning rates, otherwise the loss will diverge”. Not only you can train your network with higher learning rates, but also be able to reach the global minima confidently (super convergence). A series of papers [Cyclical Learning Rates for Training Neural Networks], [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates] and [A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay] from US Naval research laboratory in past years by Leslie N. Smith, illustrate that.

In this post, I will shed some light on finding good learning rate(s) using cyclical learning rates, along with various experiments described in the one-cycle-policy paper. We will be focusing more on the implementation details.

Revealing the Black-Box:

Training with higher learning rates not only helps in reducing the training time (thus reduces chances of over-fitting) but also improves the model performance (model generalizes, instead of memorizing the data). But one problem here, “using too large learning rates, can cause training to diverge and using lower learning rates, over-fitting can occur (slower training)” . So, what’s the solution? The solution is to find the optimal learning rate(s) for your data.

img-1. Effects of Learning Rates on Training

I. Finding the optimal Learning Rate: (using CLR)

According to CLR paper, your training has to go through no.s of cycles during a complete run, in which the learning rate oscillates between a lower and an upper bound.

Motivation for using Cyclical learning rate: Well, experiments have shown, cyclically varying the learning rate between reasonable bounds can actually increase the accuracy of the model in fewer steps. And the reason behind is:

higher learning rates help in coming out of saddle points and the lower learning rates prevents the training from diverging.

What is a saddle point? A saddle point, by definition, is a critical point in which some dimensions observe a local minimum while other dimensions observe a local maximum.

img-2. Increasing the learning rate helps in “more rapid traversal of saddle points”

What is a Cycle? The number of iterations where the training goes from lower bound of learning rate to it’s higher bound and back to lower bound.

What is a stepsize? stepsize is the half of a cycle (see figure below)

We have to find the optimal learning rate range (see below) where the loss is still decreasing, using some ideas from the CLR paper known as LR range test:

The idea of LR range test, (as the CLR paper suggests ):

start with a small learning rate (like 1e-4, 1e-3) and increase the lr after each mini-batch till the loss starts exploding. Once loss starts exploding stop the range test run. Then plot the learning rate vs loss and choose the learning rate at-least one order lower than the learning rate where the loss is minimum( e.g. if loss is low at 0.1, good value to start is at most 0.01). This is the value where loss is still decreasing. [Credit]

We will be updating the Learning rate after each mini-batch as following:

Let, q be the factor by which we increase learning rate after every mini-batch.
Below image shows the equation to find the learning rate after i-th mini-batch:

*img-5.* *Learning Rate update after every mini batch*

n = number of iterations
init_lr = lower learning rate. We’ll start range test from this value (usually 1e-3, 1e-4), max_lr = maximum learning rate to be used (usually 10, 100 . note that we may not reach this lr value during range test)

“Oh please! every time I see this math stuff, it goes over my head!!”

Don’t worry! Math will never be a problem if you carefully look at the formula (many times if needed)and try writing the same-thing not in pen but in code.

Coding Tips: use the exact English names for the symbols as your variable names, this will help in throwing away any confusion(s) about the context without losing the simplicity of the code.

Okay, enough lectures! let’s write it in code: (feel free to look at the comment lines, some explanations are embedded there)

Cyclical Learning Rates

Explanation:

We first define a class named CLR(): which is initialized with train_dataloader , base_lr, and max_lr.
Next, we declare a method calc_lr(): which accepts a loss, calculate a bunch of things and returns the modified lr. It keeps tracks of the iteration numbers (iteration), calculates the multiplication factor mult for that iteration and then the learning rate lr for that iteration and finally appends the lr in learning rate list lrs .
calc_lr() also compares the current iteration loss with the best_loss (default=1e9) and replaces it with loss value if it’s lower than best_loss with iteration no greater than 1 and finally appends it in the losses list.
The lr calculation stops, when “the loss is greater than 4-times of best_loss or if the loss is nan .”

6. stopping criteria

If we do LR range test with raw_loss and plot the raw_loss vs learning rate, then we will end up with a zigzag figure something like this:

To get a smoother value for loss (visualization is clearer) we will calculate the moving average of loss, using a statistical concept called: “Linear Interpolation ” (similar to Momentum concept in SGD)

img-8. Average Loss calculation

where β is the parameter we need to pick between 0 and 1 (typically 0.98). Next, we calculate the smoothed loss for each mini-batch. For i-th mini-batch, the smoothed loss is given by,

*img-9.* *Smoothed Loss for* ***i-th*** *mini-batch*

Now implement this loss smoothing:

LR Find

When the LR range test is done, we can plot smoothed_loss vs learning rate, (now the plot is more clear and intuitive)

10. Learning Rate vs Smoothed Loss (optimal LR range is between **2e-2 to 2e-4**)

II. Training using One Cycle Policy:

In the paper, “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”, Leslie Smith describes an approach to set hyper-parameters (namely learning rate, momentum, and weight decay), and batch size. In particular, he suggests 1 Cycle Policy to apply learning rates. [Credit]

Using High Learning Rates:

Unlike the previous paper (where your training goes through a no. of cycles, see img-6 above ), this paper suggests to do :

only one cycle (note the, ‘one’) of 2 steps (stepsize=2, see img below) of equal length, where the maximum learning rate is chosen from LR range test (found previously) and the minimum learning rate is 1/5th or 1/10th of maximum.
We pick the cycle length slightly lesser than the total number of epochs to be trained.
And in the last remaining iterations, we annihilate learning rate way below lower learning rate value(1/10 th or 1/100 th). [Credit]

Didn’t get, let’s try to understand from the plot (lr vs iters) below.

img-11. Learning rate schedule in 1-cycle-policy

As you can see above, the entire training goes through only 1-cycle, from a lower learning rate (min_lr) boundary to a higher (max_lr) in step-1 (warm-up) and returns back to the lower boundary (min_lr) in step-2 (cool-down). At the last part of training (annihilation phase), we decrease the learning rate up to a value lower than (1/10th) the min_lr.

The motivation behind is:

“during the middle of training (when the learning rate is higher), it (l.r.) works as regularization method and prevents the model to land in a steep area of the loss function, preferring to find a minimum that is flatter, thus keeping the network away from overfitting. And the last part of the training, with descending learning rates up until annihilation will allow us to go inside a steeper local minimum inside that smoother part.” [Credit]

2. Using Cyclical Momentum:

Leslie, in his experiment, found that:

decreasing momentum while increasing learning rates leads to better results.

This supports the intuition that in that part of the training, we want the SGD to quickly go in new directions to find a flatter area, so the new gradients need to be given more weight. In practice, he recommends:

pick two values for momentum like 0.85 and 0.95, and decrease from the higher one to the lower one when we increase the learning rate, then go back to the higher momentum as the learning rate goes down. [Credit]

Here is how our OneCycle() class look like:

Explanation: As you can see, it has 3 methods calc_lr (for scheduling learning rates), calc_mom (for scheduling momentum) and calc (calculates learning rate and momentum for each batch)

calc_lr (see img-11 above):

we got 3 cases here :

a-b (warm-up)
b-c (cooldown)
c-d (annihilation)

i. case a-b (iterations between 0 to step_len ): Learning rate increases form high_lr/div to high_lr. Here,

ratio = (current_iteration/step_len)

As we increase the learning_rate linearly,

learning_rate = high_lr/div + ratio*(high_lr - high_lr/div)
<=> learning_rate = high_lr * (1 + ratio*(div-1))/div)

ii. case b-c (iterations between step_len to 2*step_len): Here the learning rate decreases from high_lr to high_lr/div. We need to subtract the iterations in step-1 to calculate the ratio (to get the exact no.s of iterations), so

ratio = (current_iteration - step_len) /step_len

Here we decrease the lr value linearly, so

learning_rate = high_lr/div - ratio*(high_lr-high_lr/div)
<=> learning_rate = high_lr * (1 - ratio*(div-1))/div)

iii. case c-d (iterations between 2*step_len to total iterations): Learning rate annihilates from high_lr/div to (high_lr/div)/div.

ratio = (current_iteration - 2* step_len) / (total_iterations - 2* step_len).

Here we decrease the learning rate below min_lr, so (1-ratio) is used instead of the ratio. Hence,

learning_rate = (high_lr/div)/div + (1-ratio) (high_lr/div -(high_lr/div)/div)
<=> learning_rate = (high_lr/div)* (1-ratio*(1–1/div))

calc_mom: (see img-12 above)

Momentum value decreases as Learning rate increases and vice-versa (between high_mom and low_mom), so for the 3 cases, (doing it oppositely as done for calc_lr)

i. case a-b (iterations between 0 to step_len ): Momentum decreases form high_mom to low_mom. Here,

ratio = (current_iteration/step_len).

As we decrease the momentum linearly,

momentum = high_mom - ratio*(high_mom — low_mom)

ii. case b-c (iterations between step_len to 2*step_len): Here the momentum increases from low_mom to high_mom. We need to subtract the iterations in step-1 to calculate the ratio (to get the exact no.s of iterations), so

ratio = (iteration - step_len) / step_len

Momentum is increased linearly, so

mom = low_mom + ratio*(high_mom - low_mom)

iii. case c-d (iterations between 2*step_len to total iterations):

Momentum is constant (i.e. max_mom)

momentum = high_mom

3.Weight Decay: Author suggests to use values like 1e-3, 1e-4, 1e-5 and 0 to start with if there is no notion of what is correct weight decay value.

4.Batch Size: Paper suggests the highest batch size value that can be fit into memory to be used as a batch size.

Notebooks:

You can follow any of the notebooks that speak about the post.

Resources:

Finding Good Learning Rate and The One Cycle Policy post by nachiket tanksale
Pytorch implementation of One_Cycle_Policy : github.
The 1cycle policy post by @sgugger
How Do You Find A Good Learning Rate post by @sgugger
What’s up with Deep Learning optimizers since Adam? post by Phúc Lê
Setting the learning rate of your neural network post by Jeremy Jordan.

This post is a sincere effort to decode the training process with one_cycle_policy, by putting all required stuff together. I would like to thanks nachiket tanksale for his awesome blog post (some parts of this post are his words from his post) and his github repo on the same. (Code snippets used in this post belongs to his repo). Also, the credit goes to the fastai team, without Jeremy Howard and Sylvain s’ course and amazing library, Rachel Thomas’s encouragement for making a post, it was a nightmare. I am very grateful to Kartik Godawat for his help in formatting this article.

Feel free to connect with me on LinkedIn, or follow me on Github or Twitter. To stay updated with my posts, do follow me on Medium. You can follow our publication on Medium here.

Don’t forget to give your 👏 and follow us!