Exploring the Learning Rate Survey
This post will showcase a method to gather important information about how a deep learning model behaves under different learning rates. After a little bit of background, we’ll explore
* run-to-run variability of learning rate surveys.
* the effect of survey duration.
* how batch size and learning rate affect model training.
We end by showing the results of a grid search of learning rates, confirming the utility of the learning rate survey. The code used to generate the figures in this post is available on github here.
Background
In deep learning, the learning rate is widely considered one of the most important hyperparameters to optimize. Many guides and tutorials suggest performing a grid search for the optimum learning rate. That is, choosing n learning rate values between some minimum and maximum and fully training the model on each of them. But this just begs the question, “What are good values for the minimum and maximum learning rate?”
In the paper Cyclical Learning Rates for Training Neural Networks, Leslie Smith suggests a strategy.
It is a “LR range test”; run your model for several epochs while letting the learning rate increase linearly between low and high LR values. […] Next, plot the accuracy versus learning rate. Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for bounds
Let’s take this strategy and very slightly modify it to arrive at what we’ll call the “learning rate survey”. Firstly, we can increase the learning rate from extremely small values exponentially instead of linearly, allowing us to better sample the lower range of learning rates. Also, we will plot the training loss instead of the accuracy, since accuracy isn’t always available whereas we always calculate the training loss (at least for any supervised learning application). Full disclosure, these modifications are not mine, I originally saw them in the fantastic online course from fast.ai.
Example Learning Rate Survey
Let’s carry out a learning rate survey, training the model for 1 epoch with a learning rate that rises exponentially from 1e-4 up to 10. If the Training Loss rises above 3x it’s lowest value, we will stop early.
Here we can see that the model isn’t improving much for learning rates below about 1e-2. The loss improves for higher learning rates until it begins to get unstable and diverge at around 2 or 3.
Run-to-Run Variability
A number of sources of randomness will change the results of a survey. These include, random model parameter initialization, random data sampling, and other more subtle sources of randomness such as the non-deterministic SurfaceConvolution function in the CuDNN library. So lets take a look at how much the learning rate survey changes from run to run.
Variability grows as the learning rate grows. The learning rate where the model begins to diverge also varies some, but generally less than a factor of 2 or 3.
Learning Rate Survey Duration
In the cyclical learning rate paper, they suggest running the LR range test for 8 epochs or so. Let’s see how the survey changes as we adjust the duration.
As expected, the loss begins to drop earlier when the survey includes more training steps (longer duration). The minimum value of the loss is lower for longer duration surveys, again due to the increase in the number of training updates to the model parameters. Importantly though, it appears that the divergence occurs consistently above LR=2 regardless of the survey duration. It’s hard to say if the minimum shifts significantly, perhaps but not much.
Learning Rate and Batch Size
Various papers such as Large Batch Training of Convolutional Networks and DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE suggest that the batch size is another important hyperparameter to explore. How does the learning rate survey change with increasing batch size?
The top plot shows some interesting changes are occuring with increasing batch size, but it’s hard to disentangle them with the differences due to having more training steps in the sample. Below that, we fix the number of training steps per sample and it becomes more clear what the effects of increased batch size are. Consistent with the conclusions of the papers mentioned above, the survey reveals that the higher learning rates can be used when the batch size is larger. It also shows that below a certain batch size, the normalizing effect of small batch sizes overwhelms training even at low learning rates and the model becomes untrainable.
Comparison to Grid Search
Let’s put it all together and see the results of training a simple model on the FashionMNIST dataset. All of the learning rate surveys shown so far have been on the MNIST dataset, so lets first compare and contrast training the model on those two datasets.
It is clear that training a classifier on the FashionMNIST dataset will be different from training the same model on the MNIST dataset. The survey indicates that the loss will not fall as fast (FashionMNIST is known to be harder to classify than MNIST). Perhaps surprisingly, smaller learning rates appear to lower the loss more effectively for FashionMNIST relative to regular MNIST. Finally, the survey shows that for high learning rates FashionMNIST is much more likely to diverge during training than MNIST.
Based on this learning rate survey, let’s look at a grid search covering three orders of magnitude starting at 0.001 up through 1.
The grid search shows that learning rates from about 0.02 through 1.0 all reach about the same final classification accuracy. Consistent with the learning rate survey, values below 0.02 show reduced efficacy.
The number of training steps needed to fully train the model decreases as the learning rate increases. Therefore for maximum training efficiency, choose the highest learning rate that is stable.
Conclusion
We’ve seen that the learning rate survey can help to identify the behavior of a deep learning model under varying learning rates. It is particularly useful to determine the maximum stable learning rate, since training is most efficient with high learning rates.
Next Steps
The training in this post used a learning rate that was held constant until the training loss stopped falling, and then was dropped by a factor of 10. Training continued like this letting the learning rate drop 3 times before ending. In future posts we’ll examine how this popular strategy compares to other learning rate schedules such as cyclical learning rates and exponential learning rate decay.
We may look at more modern architectures such as resnets, and more difficult (but still small) datasets like CIFAR-100, and SVHN.
We could also examine the relationship between learning rate and batch size more closely.
Please leave a comment if you have any suggestions for what we should investigate next.