Exploring Stochastic Gradient Descent with Restarts (SGDR)

Example of Cyclic Learning Rates: From paper Snapshot Ensembles

This is my first deep learning blog post. I started my deep learning journey around January of 2017 after I heard about fast.ai from a presentation at a ChiPy meetup I attended (the Chicago Python user group). It was my first introduction to the topic and the only thing I knew about neural networks before were simple multi layer perception (MLP) models that acted as decent non-linear approximators for various prediction problems. Once I started my first couple of lessons, I was completely hooked on the amount of knowledge exploding about his subfield, especially because I had recently received my masters degree in analytics and was trying to stay up to date on becoming the best data science practitioner I could be. I am currently going through the third fast.ai course, which is part1_v2 using a framework built on top of PyTorch. One of the very important skills I have been picking up from these courses is the ability to decompose research papers that are coming out in the field and understanding / implement them. One of the interesting concepts I have seen come out is the idea of Stochastic Gradient Descent with Restarts, or SGDR.

Before, we get too far into SGDR, let’s first briefly cover what is the idea behind normal stochastic gradient descent and why do we use it.

In deep learning, you may have an architecture that may have millions of parameters and those millions of parameters are connected via functions that take some input and try to predict the output. When training a model, we know what the output should be since we pre-label our data. This leads to the idea of a “loss function”, that is essentially a function that attempts to measure the deviation of our prediction to the actual value we know is correct. Some common examples of loss functions include Mean Squared Error and Log Loss.

You can think of the entire idea behind deep learning as setting up some loss function and trying to update the many parameters the model has in some non random way to minimize this loss function.

Visualized, you can see the example to the left that with 2 parameters we have some non linear loss function. If our parameters happened to be initialized to the blue dot, then we need a way to move around parameter space to the lowest point. This is where the idea of gradient descent comes in as our method to update our parameters and “search” for where the lowest point is. There are many different methods of applying the idea of gradient descent for this search such as Momentum, Nesterov, Adagrad, AdaDelta, Adam, and more. Each of these has slight nuances to how you take your next “blind” step around this hilly terrain. I won’t get too far into the math behind gradient descent and those various optimizers, but if you are interested, you can find a superb explanation of the math here. What all of these optimizers do have in common though, is that they use a technique called “annealing” for the learning rate (how small or large of a step to take), which essentially says that we can adjust our learning rate the more we train.

Take for example this picture where the red line is the loss function. If we start at the point (3,9) and say we are going to find the direction of decreasing loss and take a step of 2 in that direction, the first step we take we will end up at (1,1). That’s great, but let’s say we take another step following the same rules, we now end up at (-1,1). And if we do the same rules again, we are back where we just were at (1,1). If we adjust our step size smaller as we get closer to this minimum, we can eventually end up closer to (0,0), which is the actual minimum.

These optimizers all follow their own pattern of finding this minimum loss by giving rules on how to adjust their step size, but what if we get stuck in a local minimum of this function? Even though some optimizers try and use ideas of momentum to have the ability to swing out of a local minimum, they are not always that successful. So a possible solution? SGDR

The idea behind SGDR as shown in this image is, instead of trying to add various forms of learning rate decay, let’s reset our learning rate every so many iterations so that we may be able to more easily pop out of a local minimum if we appear stuck. This has seemed to be quite an improvement in various situations as compared to the normal SGD using mini batches.

An idea to take this further is that that since we know we will most likely get closer to a global minimum the more iterations we do through our dataset, we need a way to lengthen the time spent decreasing our learning rate. Instead of restarting every epoch, we can lengthen the number of epochs in a multiplicative way so that the first cycle will decrease over the span of 1 epoch, the second over the span of 2 epochs, and the third using 4 epochs, etc. This can be visualized by taking the above graph and just stretching each decrease cycle out to twice the iterations as the previous decrease cycle. This tends to perform even better in comparison to the normal SGDR from above based on a couple of experiments I’ve tested.

Hopefully this helps to show that many of these cutting edge ideas that are making deep learning models perform better are relatively small modifications on the base methodologies, are easy to understand, and are fairly easy to implement.