Paper Review: High-Performance Large-Scale Image Recognition Without Batch Normalization
Training a deep network is very hard, as networks are prone to vanishing or exploding gradients. This problem was solved by the introduction of residual blocks and batch normalization. These two approaches have led to the successful training of deeper networks, with greater accuracies on the training and test set.
High-Performance Large-Scale Image Recognition Without Batch Normalization, was authored by Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan in 2021.
This paper explained how deep networks can be trained without using the Batch Normalization layer i.e Normalizer Free Network(NFNets) while maintaining the high accuracies in train and test sets, and faster training time compared to the previous state of the art architecture
“Batch normalization (also known as batch norm) is a method used to make artificial neural networks faster and more stable through normalization of the layers’ inputs by re-centering and re-scaling.” — Wikipedia
Batch Normalization was proposed by Christian Szegedy and Sergey Ioffe in their paper titled Batch normalization: Accelerating deep network training by reducing internal covariate shift. The core idea behind the Batch Norm layer as explained by Szegedy and Ioffe is to reduce the covariate shift of the layers’ inputs i.e stabilize the network by normalizing the distribution of the layers’ input.
What does the Batch Normalization layer do?
The authors outlined four positive effects of the Batch Normalization layer on the network.
- Batch normalization downscales the residual branch
- Batch normalization eliminates mean-shift
- Batch normalization has a regularizing effect
- Batch normalization allows efficient large-batch training
The authors explored another way these effects can be achieved while eliminating the disadvantages of using the Batch Normalization layer. The major disadvantages of the Batch Normalization layer outlined include:
- Expensive computation overhead
- Breaks independence between training examples in the mini-batch
- Difficulty in replication on different hardware
Networks without Batch Normalization Layer
The paper outlined previous works to train ResNet without including the Normalization layer, while achieving competitive accuracies and only two of the advantages described above. One of those approaches involved suppressing the scale of the activation(by introducing learnable scalars), as well as adding regularization to the unnormalized networks. In order to achieve competitive test accuracies, recovering the two benefits of the normalization layer is not enough.
The paper built on the work on Normalizer Free ResNet (NF-ResNet) by introducing gradient clipping techniques, constraining the gradient norm. Gradient clipping is important for poorly conditioned loss landscapes or when training with large batch sizes.
Gradient clipping involves forcing the gradient values to minimum or maximum value if the gradient is not within a predetermined threshold. Due to this property, it can be used as a normalization technique and used to prevent vanishing or exploding gradients in deep networks.
Consider the gradient vector G = ∂L/∂θ, such that L represents the loss and
θ represents the vector with all model parameters, the clipping algorithm clips the gradient before updating θ as:
where λ represents the clipping threshold and tunable hyper-parameter. If the gradient is greater than the clipping threshold, the gradient is clipped by multiplying the unit vector of the gradient with the threshold.
Using the gradient clipping technique explained above, it was observed that the training stability is sensitive to the choice of the clipping threshold. This implies that the model depth, batch size or learning rate must be well-tuned when varied. The paper introduced another gradient clipping technique called Adaptive Gradient Clipping.
“The AGC algorithm is motivated by the observation that the ratio of the norm of the gradients to the norm of the weights of layer, provides a simple measure of how much a single gradient descent step will change the original weights.”
The gradient is clipped by the product of the clipping threshold, the weight norm and the unit vector of the gradient if the ratio of gradient norm to weight norm is greater than the clipping threshold.Adaptive Gradient Clipping (AGC) clips the gradient based on the unit-wise ratios of the gradient norm to the parameter norm.
To prevent zero-initialized parameters from having gradient clipped to zero, the following condition is observed:
Using AGC, NF-ResNet was stably trained with a larger batch size of up to 4096. Training an NF-ResNet architecture (without AGC) with a batch size around that failed to train. The optimal clipping parameter λ may be dependent on the choice of the optimizer, learning rate, and batch size.
NFNet employs the residual block of the form below.
h represents the input to the function, f represents the function to be computed at the i-th residual branch. α and β represent the rate at which the variance of the activation increases after each residual block and the standard deviation of the inputs to the i-th residual block respectively.
The authors outlined that
- Clipping the final linear layer is not advisable.
- It is advisable to clip the weight of all four stages to achieve stability when training large batch sizes.
The paper explored the model trained without the normalization layer. These models achieved higher classification accuracies compared to previous models while having a fast training time. To achieve this, a clipping technique called Adaptive Gradient Clipping was introduced. AGC, a gradient clipping technique stabilizes large-batch training and enables optimization of unnormalized networks with strong data augmentations.
- Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan. High-Performance Large-Scale Image Recognition Without Normalization, 2021.
- Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition, 2016.
- Batch Normalization, https://en.wikipedia.org/wiki/Batch_normalization
- How to Avoid Exploding Gradients With Gradient Clipping, https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/
- Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem), https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem