Improving the way we work with learning rate.

Published in

techburst

8 min readNov 16, 2017

I. Introduction

Most optimization algorithms(such as SGD, RMSprop, Adam) require setting the learning rate — the most important hyper-parameter for training deep neural networks. Naive method for choosing learning rate is trying out a bunch of numbers and using the one that looks to work best, manually decreasing it over time when training doesn’t seem to improve the loss anymore.

In this post I address several problems that emerge when using this(or similar) method and describe possible solutions which I learned from Jeremy Howard by taking new in-person version of the fast.ai course [1] (which will be available online later this year on course.fast.ai).

II. So what’s the problem ?

As you start training your neural nets you can(and probably will) encounter some issues:

Picking the right value for your learning rate can become quite a cumbersome process and sometimes is more of an art than science.
When you manage to pick the right value for your hyper-parameters you will see that training deep neural networks takes a very long time. This is a common problem for deep learning and is not directly related to learning rate, but I will show how picking better learning rate policy can help you improve training time by reducing the number of iterations your optimizer has to make to converge to a good local minima.

Let’s focus on one issue at a time.

III. Picking the right value for learning rate

While there are some good guidelines for estimating a reasonable starting point for choosing a learning rate, they don’t provide a general algorithm for finding it and mostly are situation-specific or have other limitations and can’t be applied to all use-cases.

Leslie N. Smith [2] in his paper “Cyclical Learning Rates for Training Neural Networks.” Section 3.3 proposes a definitive way to estimate a good learning rate. In order to do this you need to run training with very low learning rate and linearly(or exponentially) increase it every iteration. Training should be stopped when loss function starts to drastically increase. Record the learning rate and loss(or accuracy) at each iteration.

After it is done, plot the learning rate against loss(or accuracy). You’ll probably see plots that look like these:

Left: Learning rate plot against loss. Right: Learning rate plot against accuracy.

If plotting the learning rate against loss function, you should see the following. When learning rate is too small, loss doesn’t change much, but as learning rate goes higher, your loss should decrease faster and faster until a point where it doesn’t decrease anymore and eventually starts increasing. It might be intuitive to think that the learning rate we want is the one that corresponds to the lowest loss(0.1 on the left plot). However, at this point loss stopped decreasing and we most likely wouldn’t see any improvements using this value. We want the one that’s a little bit to the left and corresponds to the point where loss is still decreasing(the faster the better)(somewhere around 0.01 on the left plot).
When plotting against accuracy, note the learning rate when accuracy slows, becomes ragged, or starts to fail. That’s the value you want to go for. Around 0.005–0.006 on the right plot.

IV. Cyclic learning rate schedule

To understand why learning rate can help in any way with accelerating the training process, we need to understand some of the reasons training takes so long. In order to do that, we need to know what a saddle point is.

Saddle point is a point where derivatives of the function become zero but the point is not a local extremum on all axes. In 3D it looks like this:

Saddle point in (0, 0, 0). Local minima for one axis and local maxima for the other.

Dauphin et al. [3] argue that the difficulty in minimizing the loss arises from saddle points rather than poor local minima. While Ian Goodfellow et al. [4], in their book “Deep Learning”, describe a mathematical proof why gradient-based optimization algorithms are able to escape from saddle points, they do slow down the training as surface around such a point is much flatter and gradients tend to be close to zero.

General idea is instead of using a fixed value for learning rate and decreasing it over time, if the training doesn’t improve our loss anymore, we’re going to be changing the learning rate every iteration according to some cyclic function f. Each cycle has a fixed length in terms of number of iterations. This method lets the learning rate cyclically vary between reasonable boundary values. It helps because, if we get stuck on saddle points, increasing the learning rate allows more rapid traversal of saddle point plateaus.

Leslie N. Smith [2] proposes a ‘triangular’ method in which learning rate at each cycle is linearly increasing from minimum to maximum point and then linearly decreasing to minimum point again. At the end of the cycle the difference between min and max points can be cut in half.

‘Triangular’ and ‘Triangular2’ methods for cycling learning rate proposed by Leslie N. Smith. On the left plot min and max lr are kept the same. On the right the difference is cut in half after each cycle.

While this might seem counter-intuitive, the research shows that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. Leslie N. Smith continues the research, showing that his method allows faster training, reducing the number of iterations before optimization converges to good local minima.

Another method proposed by Loshchilov & Hutter [5] in their paper “Sgdr: Stochastic gradient descent with restarts”, called ‘cosine annealing’ in which the learning rate is decreasing from max value following the cosine function and then ‘restarts’ with the maximum at the beginning of the next cycle. Authors also suggest making each next cycle longer than the previous one by some constant factor T_mul.

SGDR plots. Learning rate against iterations. Left: T_mul = 1; Right: T_mul = 2.

Even though I’ve been only talking about training time, there’s actually one more big advantage of using these techniques.

The research shows that training with cyclical learning rates instead
of fixed values achieves improved classification accuracy
without a need to tune and often in fewer iterations. So why does it improve our accuracy?

Although deep neural networks don’t usually converge to a global minimum, there’s a notion of ‘good’ and ‘bad’ local minimums in terms of generalization. Keskar et al. [6] argue that local minima with flat basins tend to generalize better. It should be intuitive that sharp minima is not the best, because slight changes to the weights tend to change model predictions dramatically. If the learning rate is large enough, intrinsic random motion across gradient steps prevents the optimizer from reaching any of the sharp basins along its optimization path. However, if the learning rate is small, the model tends to converge into the closest local minimum. That being said, increasing the learning rate from time to time helps the optimization algorithm to escape from sharp minimas, resulting in converging to a ‘good’ set of weights.

I. Loshchilov and F. Hutter in their paper show that by using SGDR(Stochastic Gradient Descent with Restarts) method they proposed they were able to improve error rates on state-of-art models on popular datasets.

V. That’s not even a cherry on top.

Providing so many advantages, SGDR actually gives you one more. Gao Huang and Yixuan Li [7] inspired by SGDR wrote a follow-up paper “Snapshots Ensemble: Train 1, get M for free” in which they show how to get even better results when using ‘warm restarts’ with gradient descent.

It is known that a number of local minimas grows exponentially with a number of parameters. And modern deep neural nets can contain millions of them. Authors show that while most of them have similar error rates, the corresponding neural networks tend to make different mistakes. This diversity can be exploited through ensembling — training several neural network with different initialization. It is not surprising that they will converge to different solutions. Averaging over predictions from these model leads to drastic reductions in error rates.

Gao Huang and Yixuan Li were able to get an ensemble of nets at a cost of training a single neural network. They did that by exploiting the fact that at the end of each cycle(at least later ones) neural network can converge to some local minima or be close to it. When ‘restarting’ model will most likely jump over and start converging to the other optima. Gao Huang and Yixuan Li trained their models with SGDR and saved weights after each cycle. They then added M networks to their ensemble based on last M cycles. Their research showed that local minimas, to which model converged, are diverse enough to not overlap on misclassification examples. Using this method resulted in improving error rates even more on state of the art models.

VI. Conclusion

Using techniques I described in this post we can almost fully automate the way we work with the learning rate and actually get even better results. Although these techniques have been around for a while, Jeremy Howard mentioned that not so many researches are actually using them, which is pretty bad considering how advantageous they are.

I wanted to say Thank you to the fast.ai staff for creating these amazing courses and providing the opportunity to so many people to learn. Community they created is amazingly helpful.

References

[1] fast.ai

[2] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186

[3] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization.
arXiv preprint arXiv:1502.04390, 2015.

[4] http://www.deeplearningbook.org/

[5] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.

[6] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.
arXiv preprint arXiv:1609.04836, 2016

[7] Gao Huang and Yixuan Li. Snapshots Ensemble: Train 1, get M for free. arXiv preprint arXiv:1704.00109