Deep learning convolutional networks have had many improvements not directly related to architecture. This paper examines a collection of tricks that clearly improve performance at almost no complexity cost.
Many of these tricks have been added to fastai 😍.
The paper presents four techniques that allow to effectively train networks with large batch sizes.
Linear scaling learning rate
Since larger batch sizes mean a lower variance (lower noise) in the gradient of SGD we can be more confident that the gradient is a promising direction. Thus, it makes sense to increase the learning rate along with batch size. It was empirically proven that linearly increasing the learning rate with the batch size works empirically for ResNet50 training.
Learning rate warmup
At the beginning of training the weights typically have random values and are far away from the final solution. Using a learning rate that is too high may result in numerical instability. The trick here is to use a low learning rate initially and increase it once the training is stable.
The residual blocks in ResNet have an output to which the input of the block is added:
x + block(x)
Sometimes the last layer in the block is batch normalization which normalizes the value and then performs a scale transformation. If the normalized value is x_hat the output of the batch normalization layer is:
y . x_hat + B
where y and B are initialized at 1 and 0. If we instead initialize y as 0 the residual blocks would start by just returning the input, effectively reducing the number of layers and making it easier to train. Also the network will only modify the value of y if the transformation in the residual block is worth it (i.e. improves performance) and this avoids unnecessary computation.
No bias decay
It is recommended not to apply any regularization (or weight decay) to the bias or batch normalization parameters.
New hardware offers serious improvements in speed when using FP16 rather than FP32 (on Nvidia V100 training on FP16 offers a x2/3 increase in performance). However FP16 may cause overflow and disrupt the training process.
The suggestion to overcome this is to store parameters and activations in FP16 and use FP16 to compute gradients. All parameters have a copy in FP32 for parameter updates. For a detailed explanation see.
These tweaks help increase validation accuracy in ResNet-50 without a significant computational cost (~3% longer to train).
The first improvement consists of changing the stride in the convolutional layers. The first layer in Path A has a stride of 2 which means that it discards 3/4 of the input’s pixels. To avoid this the stride of this layer can be changed from 2 to 1 and the next layer from 1 to 2 to compensate and conserve the output dimensions.
Since the next layer has a kernel size of 3x3, even with a stride of 2 the layer takes advantage of all the input information.
The computational cost of a convolution is quadratic to the kernel width or height. A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution.
This tweak consists of replacing the 7x7 convolutional layer in the input step by three 3x3 layers (will make the model easier to train).
ResNet-D is a similar improvement as ResNet-B but with a different approach. They replaced a 2 stride convolution in Path B by an Average Pooling layer and a 1 stride convolution (this keeps the output dimensions intact). The authors report that this tweak does not affect speed noticeably.
Cosine Learning Rate Decay
Typically, after the learning rate warm-up described earlier, we decrease the learning rate as the training progresses (the intuition being that as you get closer to the optimum, high learning rates might move you away from it). A smooth function to describe this schedule is the cosine function which we can see above.
Typically the last layer of a neural network is a fully-connected layer with output dimension equal to the number of categories and a softmax activation function. If the loss is cross-entropy, for mathematical reasons, the network has an incentive to make the prediction for one category very large and the others very small and this leads to over-fitting. Label smoothing consists in changing the target from [1, 0, 0, …] to [1-e, 0+e/k-1, 0+e/k-1, …] to reduce the polarity in the target.
It is clear that with label smoothing the distribution centers at the theoretical value and has fewer extreme values.
In knowledge distillation, we use a teacher model to help train the current model, which is called the student model.
One example is using a ResNet-152 as the teacher model to help training ResNet-50.
Knowledge distillation entails adding a term to the loss function which accounts for the difference between the student model and the teacher model to ensure that the student model does not differ too much from the teacher model.
Mixup means linearly interpolating two training examples and creating a new one.
The authors trained a Fully-Connected Network on ADE20K and concluded that only cosine smoothing improved the performance in this task (2).
(1) Knowledge distillation hampers performance in two of the three architectures. According to the authors:
Our interpretation is that the teacher model is not from the same family of the student, therefore has different distribution in the prediction, and brings negative impact to the model.
(2) Why did the other improvements not improve performance?
While models trained with label smoothing, distillation and mixup favor soften labels, blurred pixel-level information may be blurred and degrade overall pixel-level accuracy.