Is Batch Normalization harmful? Improving Normalizer-Free ResNets

Published in

Geek Culture

6 min readJul 30, 2021

Batch normalization is commonly used in almost all recent deep learning architectures to improve convergence speed and improve performance. But not many works are actually concerned about BN's drawbacks, but consider them as some magic beneficial to the model. In this post, we will try to understand the true dynamics of BN, understand its challenges and limitations, and discuss an alternative way to train deep networks. We will also look at a recent technique that achieves state-of-the-art classification performance from removing batch normalization.

The paper is available at the link below.

Batch normalization: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

NF-ResNets: CHARACTERIZING SIGNAL PROPAGATION TO CLOSE THE PERFORMANCE GAP IN UNNORMALIZED RESNETS

Background+Adaptive Gradient Clipping: High-Performance Large-Scale Image Recognition Without Normalization

Batch normalization & Benefits [paper]

What does batch normalization do? In short, batch normalization aims to normalize the input by subtracting and dividing by the mini-batch statistics. Precisely, the process is described as the function below.

Empirical benefits of using batch normalization are faster convergence speed and improved accuracy. If we dive deeper into the dynamics of these improvements, batch normalization

Downscales the residual branch: As a special synergy for residual networks, BN constrains the scale of the residual branch and stabilizes early training.
Eliminates mean-shift: Output of ReLU is [0, inf), and naturally have non-zero means. This can be compounded and become problematic in deep neural networks by causing networks to predict the same label for all data.
Has a regularizing effect: Because the batch statistics are computed on another subset of the training data, BN inherently acts as a noise that strengthens the network's generalization.
Allows efficient large-batch training: BN increases the largest stable learning rate, a crucial component of large-batch training.

The benefits listed above are conclusions of other research, further elaborated in the original paper.

Problems of BN

Wow, batch normalization does seem to provide large benefits to any deep neural network. Well maybe not because batch normalization is

Surprisingly computationally expensive: BN is a very memory intense task because all the batch statistics must be stored in the layer. The computation cost is also expensive(~20% depending on networks)
The discrepancy between training and at inference: Results of inference depend on batch size? Also, introducing hyper-parameters that have to be tuned.
Breaks the independence between training examples in the minibatch: This is an obvious problem since each sample is normalized using mini-batch statistics. Consequences are the limitations in some tasks like sequence modeling, contrastive learning due to concerns of cheating.
Performance is sensitive to the batch size: BN traditionally performs poorly when the batch size is too small.
Normalization can limit model capability: Not discussed in the paper, but papers like StyleGAN2, ESRGAN find artifacts in generated images when using normalization. Because normalization forces zero-mean, the algorithm produces spikes in the image to change the magnitude of the rest of the image.

Some alternative normalization of batch normalization has been proposed to resolve the issues mentioned. Some examples are instance normalization and layer normalization. Other works modify the initialization process and the network architecture to have the benefits of BN one by one. Some works were even able to successfully outperform ResNets with BN(with additional regularization).

However, they outperform their batch normalized counterparts when the batch size is very small, but perform worse than batch normalized networks for large batch sizes. Most methods also have shared some limitations of BN or have their unique problems. The results also don’t compete with results from S.O.T.A network.

Normalizer-Free ResNets(NF-ResNets) is a network architecture without normalization that can be trained to competitive accuracies with ResNets. The residual block of NF-ResNets is scaled as the following function. We will review this paper in-depth in another post, but the technique proposed in this paper is applicable generally to any other network architecture.

Adaptive Gradient Clipping [paper]

This paper proposes Adaptive Gradient Clipping (AGC) as a technique to improve normalization-free networks. Gradient clipping stabilizes training by constraining the gradients following the formula below. However, gradient clipping was found to be sensitive to the clipping threshold λ.

Suppose W¹ and G¹ to be the weight matrix and gradients of lth layer. A Frobenius norm ||W¹||ᶠ of the weights is calculated as the square root of the sum of the square of weights.

The ratio between the Frobenius norm of the weights and gradients ||G¹||ᶠ/||W¹||ᶠ provides a better relative measure of how large the gradient step is. To solve the dependence on the clipping threshold λ, AGC clip gradients are based on the unit-wise ratios of gradient norms to parameter norms as in the formula below.

The authors suggests that

AGC is a relaxation of some normalized optimizers, which imposes a maximum update size based on the parameter norm but does not simultaneously impose a lower-bound on the update size or ignore the gradient magnitude.

Experiments

The left figure compares ResNets with batch normalization, NF-ResNets, and NF-ResNets with AGC. AGC seems to help extend NF-ResNets to larger batch sizes while maintaining similar/better performance with batch normalized networks. Training on larger batch sizes is more unstable because the learning rate is also linearly increased with the batch size. For example, when training with batch size 4096, the learning rate is an overwhelming 1.6. The right figure compares results for different clipping thresholds and batch sizes. Smaller clipping thresholds are required to constrain training with large batch sizes.

Another experiment studies whether or not AGC is beneficial for all layers by applying AGC to certain layers in the network. Results showed that it is always better to not clip the final linear layer and that it is possible to train stably without clipping the initial convolution, but all four stages must be clipped when training at batch size 4096 with the default learning rate of 1.6. AGC is applied to every layer except the final linear layer.

Seeking state-of-the-art

The paper applies various regularization and techniques to improve the performance and compete with EfficientNet. With the modifications, the ResNeXt-D model surpasses EfficientNet by a significant margin. The effects of removing batch normalization could seem disappointing since the modifications from NF-ResNet and AGC didn’t show accuracy gains as described in the table below. However, they did provide significant gains in latency, which can be converted towards improved accuracy via model scaling.

Conclusion

There have been many works on analyzing the limitations of normalization layers including batch normalization and proposing alternatives. Most alternatives of BN showed poor performance when trained with large batch sizes. The paper suggests an adaptive gradient clipping strategy to enable the training of NF-ResNets in large batch sizes and large learning rates.

However, the NF-ResNet backbone used in this paper seems to have many limitations. First of all, the modifications are specific to ResNet-based architectures with skip connections. This will not apply to many other applications of deep learning. Next, while many problems of BN related to performance were raised, this solution didn’t show significant improvements compared to batch normalized networks apart from latency. This could suggest that the solution doesn’t actually resolve the problems of BN but is just a more efficient implementation of BN.

But despite everything, I was very impressed with the results and insights this paper provided. To me, removing batch normalization while stabilizing training seems to be an important milestone to be achieved in deep learning.