14. Introduction to Deep Learning with Computer Vision — Learning Rates & Mathematics — Part 2 (SuperConvergence)

Published in

Deep-Learning-For-Computer-Vision

11 min readSep 20, 2020

Written by: Praveen Kumar & Nilesh Singh!

In the previous blog, we gained an understanding of “Learnings rates, Local and Global Minima, & Saddle points.” In this blog, we will look into how learning rates can vary to avoid certain pitfalls that a deep learning model can fall under and decrease the overall efficiency of learning.

Finding Optimal starting Learning Rate

We discussed in the last blog that choosing a learning rate is a tricky part of tuning your model’s hyper-parameter. In the above image, we see choosing a learning rate changes the model’s approach to learn features.

When we design a model, we have the option to choose any of the following types of learning rate:

Constant learning rate: Constant learning rate refers to setting the learning rate an initial value and uses the same value over the training phase without changing it over time.
Dynamically changing the learning rate: This type of learning rate varies with time and number of epochs and can adapt to high or low values based on the learning process. This may just sound right but certainly tricky to implement in reality. We shall see it in greater depth as we move ahead in this article.

About the image shown above, look at the middle plot and you can see that initially, the steps (shown in red arrows) are larger and slowly decreases as we move towards global minima.

A systematic approach to choose a learning rate?

There is no universal optimal learning rate and method (we know its sad to hear 😒). Ideally, we want to start with a learning rate that yields significant decreases in the loss function. A systematic approach in finding such a learning rate is by “observing the magnitudes of loss change with different learning rates”.

First, we need to gradually increase the learning rate either linearly (suggested by Leslie Smith) or exponentially (suggested by Jeremy Howard) as shown below,

and after each mini-batch, record the loss at each increment as shown below.

The learning rate should be set within the range where the occurrence of loss decreases drastically. In the above image, this lies somewhere close to 10e-1 and 10e-2.

This technique was proposed by Leslie Smith in Cyclical Learning Rates for Training Neural Networks and evangelized by Jeremy Howard in fast.ai’s course.

If you wish to implement for your model: Reference

Learning Rate Annealing, another systematic approach

Selecting a good starting learning rate is merely the first step. To efficiently train a robust model, we will need to gradually decrease the learning rate during training. If the learning rate remains unchanged during training, it might be too large to converge and cause the loss function to fluctuate around the local minimum. The approach is to use a higher learning rate to quickly reach the regions of (local) minima during the initial training stage and set a smaller learning rate as training progresses to explore “deeper and more thoroughly” in the region to find the minimum.

The most popular form of learning rate annealing is a step decay where the learning rate is reduced by some percentage after a set number of training epochs.

There is an array of methods for learning rate annealing: step-wise annealing, exponential decay, cosine annealing(strongly suggest by Jeremy Howard), etc. More details on an annealing learning rate at Stanford’s CS231.

More generally, we can establish that it is useful to define a learning rate schedule in which the learning rate is updating during training according to some specified rule.

Cyclic Learning Rate (CLR)

1. Intuition behind CLR

In previous approaches, we discussed that decreasing the learning rate over time to time would give us better learning performance. We always assumed that the gradient would be in a curved bowl shape. However, in real life, this is never(almost) the case. Learning rate annealing methods previously discussed will not work if the gradient curve has more minima and maxima.

In the above image, the left gradient is perfectly shaped and annealing methods will be a great choice. However, if you consider the gradient curve to the right, will annealing methods still be useful? Give it a thought! 😉

Consider the following 3 scenarios which show gradient curves in 2D:

Assume you are standing at the red line in the above image. In all the 3 cases, you need a push (learning rate should be increased) that could help you get out of the local minima and continue to search for global minima. In case you are stuck at some point and the learning rate is slowly decreased (like annealing methods), you might as well never come out of any of those 3 situations. In that case, the model’s accuracy would jitter around a very small range and neither decrease nor increase. So you require learning rates that not only decrease with time (like when you are near to your global minima) but also increase in case you get stuck in the path (like the above image). Surely one can never know where we might stand at any instance of time, but one can wisely monitor the loss and accuracy curves and can try out different strategies. Ultimately, our goal is to avoid plateaus or valleys on the path to perform better and find global minima.

2. Working of CLR

The cyclic learning rate, proposed by L. Smith, makes use of varying learning rates over time. The above image shows the behavior of the triangular policy in CLR (simply because the patterns are triangular). We have other policies that show different behavior. These behavior patterns are mathematically constructed. You can even construct your function and behavior pattern for the learning rate if you a good mathematical background.

One of the key points in understanding CLR functions is the difference between iterations and epochs. People often confuse between these two keywords. WE ARE HERE FOR YOU. DON’T WORRY. 👐
1 Epoch: One epoch refers to one complete pass of your complete training data through the model (Both forward & backward). For example, If you have 1000 images in your training set, then 1 epoch refers to allow all these 1000 images to pass through your model, and backpropagate all the weights, no matter what batch size you choose.
1 Iteration: This refers to the batch of data to complete one forward and backward pass. This depends on the batch size. For example, if your batch size is 10, then to complete 1000 images, we need 100 iterations. Why? because we are passing 10 images through our model, so we need to pass batch of 10 images for 100 times to complete 1000 images, which would then complete one epoch.

Let’s now understand how triangular CLR works. We have a few parameters to keep track of while using CLR.

Base learning rate: This is the min learning rate we are going to use.
Max learning rate: This is the maximum learning rate you want to use for your model. This can be set to a large value range such as 0.5 or even sometimes 1.0 or 2.0
Cycle: A cycle is a complete phase of learning rate where it takes values from minimum to maximum and then from maximum to minimum. We will understand more about the cycle in just a bit.
Step size: Step size refers to the half cycle.

Now to understand CLR, we have a set of formulas to follow (afraid not, we will make it ultra simple for you, promise 👌). We will explain those formulas with more intuitive examples. Let’s define a few values which we will use in the following steps.

let, Step Size = 1000

So, One cycle = 2 * Step Size = 2*1000 = 2000

The total number of iterations = 10010 (it couldn’t be 10000, because? it’s a mad mad mad world. Things are not ideal 😜. Just kidding though, we simply picked up a random value),

So, total number of cycles = ceil(10010/2000) = ceil(5.005) = 6 cycles.

[NOTE: Here we have used ceil because the total number of cycles can not have floor since it will eliminate the remaining iterations. You will find floor function in upcoming equations, do not confuse with them. The floor is used to get n-th cycle number & not the total number of cycles]

This means that throughout our model training, the CLR will have a total of 6 cycles or simply saying, 6 triangles in the graph we have seen above.

Let’s now understand how all the equations work in CLR.

Let’s look at the first formula:

This tells us which cycles we are currently in. So, in our example, its

Cycle = floor ( 1 + 10010/(2*1000) ) = floor (1 + 5.005) = 6th cycle.

this means we completed 5 cycles by 10000 iterations and for the remaining 10 iterations, we move into the 6th cycle. If we are on the 5000th iteration, then we would be in,

cycle = floor (1 + 5000/2000) = floor(3.5) = 3rd cycle. Similarly, based on the current number of iterations, we can find out which cycle we are in.

Let’s look at our second equation.

The above equation helps in (does not calculate final learning rate but only helps in the process 😵) calculating the leaning rate. In the equation, we have 3 variables, namely, iterations, stepsize & cycle (this is not the length of the cycle by the nth cycle in which we currently are, derived from the first equation). To understand this equation further, let’s look at the image below.

We have zoomed into one cycle. Each cycle has an uphill and downhill line. At the Y-axis, we have a learning rate. So, at each value of X on the x-axis, we have a different range of learning rates which oscillates between the maximum and minimum learning rate. The second equation helps us in obtaining the values on the x-axis. Let’s understand how.

With the help of the cycle parameter, we know at which cycle we currently are in. After getting to know at which cycle, we need to find the ratio of iterations/stepsize. This ratio will help us identify on which size of the cycle we are in. Simply saying, the cycle is divided into 2 parts. The left part where the learning rate goes from minimum to maximum and the right part where the learning rate goes from maximum to minimum. The step size is half of the cycle and based on the current number of iterations, we will know which side of the current cycle we are in. For example, previously we found out that if we were on the 5000th iteration, it means we are on the 3rd cycle. So, now we need to find where exactly in this 3rd cycle we are in. That is what X will help us in calculating with the help of step size.

so, for the 5000th iteration, we get

X = 1- mod((5000/1000)- 2(3)+1) = 1 — mod(5–6+1) = 1

1 indicates that new cycle just started. If iteration were 5500, then X would be 0.5, which means we are the peak learning rate and if iterations were slightly less than 6000, then X would be less than 0.05, meaning we are near to the end of the current cycle, in that case, the learning rate should be low (why? just look at the above image and guess it! Write it down in comments what you think be the reason is). If we would take other iteration, we would get a value within the range 0 to 1.

Since we have X now, we have to simply map the value of X on to the y-axis, that is, to find the corresponding learning rate.

Let’s look at the third formula, which will help us to do the same.

In the equation,

ηt = learning rate at time t

ηmin = minimum learning rate

ηmax = maximum learning rate

To find the learning rate at time t, we have 3 steps.

Calculate the difference between η max & η min.
Find max (0, x).
Add η min to the value calculated from the above 2 steps.

Thus, at each iteration, we get the corresponding learning rate, and the model trains very fast. Smith writes the main assumption behind the rationale for a cyclical learning rate (as opposed to one which only decreases) is “that increasing the learning rate might have a short term negative effect and yet achieve a long term beneficial effect.” Additionally, increasing the learning rate can also allow for “more rapid traversal of saddle point plateaus.”

Till now we have only seen the triangular policy of CLR. We have another 2 which are generally used, but we won’t discuss them at much greater depth right now. Let’s look at the curves formed by them.

The first figure on the above image shows a fixed decay value set before the start of the model training. This value will decrease the maximum learning rate by a constant value with each iteration. And, the second figure shows an exponential decay. In exponential, the max learning rate does not decrease a lot during the initial training phase and exponentially decreased towards the end phase of training. Both these policies can be mathematically tweaked. So, yes, if you can find a better tweak version, then congratulate yourself 👏 and publish a new research paper & most importantly, do let us know in the comments. 😝

In the next article, we will learn about SGD and the warm-start step. We will then see how One Cycle LR (not CLR but One Cycle LR), which is another great learning policy for better, faster, and, more accurate learning. We also will discuss the maths behind each of these and also discuss the most popular algorithms such as SGD, Adam, and RMSProp with great details & mathematics behind them. Stay tuned!

Hope you enjoyed it. See you soon in the next article.

14. Introduction to Deep Learning with Computer Vision — Learning Rates & Mathematics — Part 2 (SuperConvergence)

Finding Optimal starting Learning Rate

A systematic approach to choose a learning rate?

Learning Rate Annealing, another systematic approach

Cyclic Learning Rate (CLR)

Written by Inside AI