A Methodology to Hyper-parameter Tuning (1): Learning Rate

Published in

Deep Learning HK

5 min readFeb 28, 2019

Although tf.keras and Pytorch make it easier to build a deep learning model, you still need to make various decisions, such as learning rate, regularization, and batch size. Among all, learning rate is the first hyper-parameter you need to properly determine as a careless choice may ruin your model. Some interesting techniques may release your burden in choosing the proper learning rate:

Learning Rate Range Test
Cyclical Learning Rate Policy
1-Cycle Policy and Super-Convergence

If you want to be safe, adopting Adam optimizer with learning rate 3e-4 is the first thing to try, just like Andrej Karpathy.

But what if it doesn’t work? You may then try to slightly tune the learning rate around 3e-4 and pray for it. Luckily, you can be more scientific using the Learning Rate Range Test by L. Smith.

Learning Rate Range Test

The idea is very simple:

Select a validation set
Monitor the validation loss and tune the learning rate starting from a very small value
For each iteration, increase the learning rate
Plot the validation loss and see when the training starts diverging

The typical LR range test diagram looks like this:

LR range test for resnet56. (**source**)

Since you start training from a very small learning rate, the loss plot should initially be flat. The learning rate would soon be reasonably large as you increase the value for each iteration. The loss keeps decreasing is a signal for reasonable learning rate. The learning rate would finally reach a region where it is too large that the training diverges.

So, we can now determine the reasonable range for the learning rate. The next question is: what exact value should we choose?

The good news is you don’t need to make a decision. If you cannot choose one, choose them all.

Cyclical Learning Rate (CLR) policy

You can simply just adopt a range of learning rate rather than an exact value. The procedures are:

Run the LR range test to determine the maximum and minimum bound of learning rate

2. Vary the learning rate between these bounds

There is only one hyper-parameter not yet determined: stepsize. stepsize is the number of iterations in half a cycle. A simple rule of thumb is setting stepsize = 2–10 times the number of iterations in an epoch.

But why it works? An intuitive explanation is that cyclically increasing the learning rate enable the optimizer to escape from saddle points. As the gradients in saddle points are small, simply fixing the learning rate in a small value would slower the training process.

Another explanation is that periodically increasing the learning rate allows the network to generalize better. Since we want the network generalize to the unseen data, we would prefer a flat minimum, which is more robust to shift between training and testing distribution, over a sharp minimum.

The black line is the training distribution and the red line is the testing distribution. The y-axis is the loss. (source)

As we can see in the above diagram, if the optimizer stays in the sharp minima, a small shift of testing distribution can lead to a huge drop of accuracy. Optimizers adopting CLR policy can “jump out” of the sharp minima as the learning rate occasionally becomes large, which allows the optimizers to take large enough step to escape from it.

The idea of “large learning rate enables the network to generalize better” is further developed by Smith (again…) and he proposes a even more radical technique called 1-cycle policy and a phenomenon called “super-convergence”.

1-cycle Policy and Super-Convergence

Super-convergence is just a fancy name created by Smith to describe the training converges very rapidly. Particularly, the optimizer can achieve comparable performance with significantly less iterations than normal using 1-cycle policy.

1-cycle policy is very similar to CLR policy. The only difference is that it only involves one cycle.

Run LR test to determine the maximum bound, set minimum bound as 1/10 of the maximum bound
Set the stepsize of one cycle that is smaller than the total number of iterations used by other policies.
Allow the learning rate to decrease several orders of magnitude less than the minimum bound for the remaining iterations

The learning rate over the 1-cycle policy. After finishing one cycle, allow the learning rate to continually decrease for several orders of magnitude.

The philosophy of CLR policy is still applicable to 1-cycle policy. Large learning rate enforces the optimizer to search for a flat minimum. After reaching the plateau, the optimizer needs to settle into the local minimum so the learning rate should be small.

Conclusion

Beside blindly apply 3e-4 with adam optimizer, a more scientific methodology is possible. Combining LR range test with CLR policy or 1-cycle policy releases our burden on choosing the proper learning rate.

Another insight from Smith is that large learning rate is in itself a kind of regularization. Since large learning rate introduces larger gradient noise, it leads to better generalization. This insight raises a further question on how to balance different kinds of regularization method (batch size, momentum, weight decay), which is the topic of the next post.

Reference

[1]: Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016).

[2]: Smith, Leslie N. “Cyclical learning rates for training neural networks.” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017.

[3]: Smith, Leslie N., and Nicholay Topin. “Super-convergence: Very fast training of neural networks using large learning rates.” arXiv preprint arXiv:1708.07120 (2017).

[4]: Smith, Leslie N. “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay.” arXiv preprint arXiv:1803.09820 (2018).