Training Neural Networks upto 10x Faster

This is an excerpt from the Object detection software wiki. I am currently in the process of implementing some of the most cutting edge object detection algorithms here and am going to blog my experience implementing them.

Today we talk about how to train neural networks much faster than the norm by using 2 techniques. Cyclic Learning rates and Superconvergence.

The training methodology for the project seeks to make training easy and fast. Recently, techniques like superconvergence and cyclic learning rates have led to great improvements to the time for model convergence.

The object detection software seeks to use the state of the art training techniques. And during this seach to find the best methods, I chanced about this technique used by the fastai library.

Learning Rate Finder

First introduced by Leslie N Smith(link) of CLR (link) fame, the learning rate finder seeks to find the optimal learning rate to start the training for any dataset/architecture.

The technique is quite simple. For one epoch,

  1. Start with a very small learning rate (around 1e-8) and increase the learning rate linearly.
  2. Plot the loss at each step of LR.
  3. Stop the learning rate finder when loss stops going down and starts increasing.

Some questions come to mind:

  1. How do you increase the LR linearly?
  2. When do you stop this process, if loss doesn’t stop decreasing ?

Ans 1: The linear change policy of LR is given as follows. We have a init learning rate (~1e-8) and a final learning rate (8) (something large). Some sample code is given as follows:

num = 100 # the number of mini batches in an epoch
mult = (final_value / init_value) ** (1/num) # the lr multiplier
new_lr = old_lr*mult

Ans 2: This is a very rare scenario, if this does happened, then run the LR finder again, with the final_value a factor bigger. Generally a good final learning rate would be around 8.

The learning rate finder is great for finding an optimal learning rate to start with and also gives more insight on how the model converges with the dataset.

We use the information from this learning rate finder to implement our 1 cycle policy that is described in the next session.

The code for the learning rate finder is given as follows:

The graph for the execution for this function is given below.

.

Looking from this graph, the learning rate of 0.01 seems to be a good value to train the network. The maximum learning rate we could choose would be around ~0.1.

1 Cycle Policy

The 1 Cycle Policy states that one can converge a model in far lesser time than the norm using cyclic learning rates. Using this methodology we aim to achieve something called superconvergence.

Superconvergence is the training of model in exponentially lesser time than the norm with the same hardware.

An example of superconvergence can be found while training the cifar10 dataset.

Using the 1 Cycle Policy, the network gets converged to a 93% accuracy in 20 epochs with a batch size of 150.

This is vastly better than the 500 epochs using the manual learning rate steps of 0.1 for 1–150 epochs, 0.01 to 150–300 epochs to 0.001 to 300–500 epochs with a batch size of 150.

All results have been tested on the Colab notebook with a single GPU (Nvidia Tesla K80) of ~11 GB.

The one cycle policy is given as follows:

Note: Lets call the learning rate found by the LR finder L.

  1. Start with a learning rate L/10 than the learning rate found out by the learning rate finder.
  2. Train the model, while increasing the LR linearly to L after each epoch.
  3. After reaching L , start decreasing the LR back till L/10.

The graph for the change of Learning rates will look like follows:

Divide this learning rate sequence across ~20–30 epochs and check the validation accuracy. If the model doesn’t reach the required accuracy, increase the number of epochs.

Some results that have been achieved on the Cifar10 dataset while experimenting with this approach are given here

However, there are many questions with this approach. These are listed as follows:

  1. What optimiser should one use ?
  2. What batch size should one select ?
  3. Does this work for both big and small datasets ?
  4. What model architectures don’t support 1 cycle policy ?
  5. Should you test cyclic momentum ?

These are questions that need a solid answer, we shall research these questions properly and arrive upon a solid set of rules while training networks.