Overview of Normalization Techniques in Deep Learning

A simple guide to an understanding of different normalization methods in Deep Learning.

Maciej Balawejder
Nerd For Tech

--

Photo by Anne Nygård on Unsplash

Training Deep Neural Networks is a challenging quest. Over the years, researchers have come up with different methods to accelerate and stabilize the learning process. Normalization is one technique that proved to be very effective in doing this.

Different types of normalizations

In this blog, I will review every one of these methods using analogies and visualization, which will help you understand the motivation and thought process behind them.

Why Normalization?

Imagine that we have two features and a simple neural network. One is age with a range between 0 and 65, and another is salary ranging from 0 to 10 000. We feed those features to the model and calculate gradients.

n — example

Different scales of inputs cause different weights updates and optimizers steps towards the minimum. It also makes the shape of the loss function disproportional. In that case, we need to use a lower learning rate to not overshoot, which means a slower learning process.

The solution is input normalization. It scales down features by subtracting the mean(centering) and dividing by the standard deviation.

n — example

This process is also called ‘whitening’, where the values have 0 mean and unit variance. It provides faster convergence and more stable training.

It’s such a neat solution, so why we don’t normalize the activation of each layer in the network?

Activation

Batch Normalization

N — batch, C — channels, H,W— spatial width and height. Source

In 2015 Sergey Ioffe and Christian Szegedy[3] picked up that idea to solve the internal covariate shift issue. In plain English it means that the input layer distribution is constantly changing due to weight update. In this case, the following layer always needs to adapt to the new distribution. It causes slower convergence and unstable training.

Batch Normalization presents a way to control and optimize the distribution after each layer. The process is identical to the input normalization, but we add two learnable parameters, γ, and β.

Instead of putting all the maths equations, I created code snippets that I find more readable and intuitive.

These two parameters are learned along the network using backpropagation. They optimize the distribution by scaling(γ) and shifting(β) activations.

Effect of gamma and beta

Since we have fixed distributions, we can increase the learning rate and speed up the convergence. Besides computational boost, BN also serves as a regularisation technique. The noise generated by approximation of the dataset’s statistics removes the need for a Dropout.

But it’s a double-edged sword. This estimation is only tolerable for larger batches. When the number of examples is smaller, the performance decreases dramatically.

ResNet50 validation’s error. Source

Another downside of the BN is testing time. Let’s use a self-driving car as an example. You pass a single frame recorded by the camera during the driving rather than the batch of images. In this case, the network has to use pre-computed mean and variance from training, which might lead to different results.

The significance of this problem pushed the community to create alternative methods to avoid dependency on the batch.

Layer Normalization

Source

It is the first attempt made by Geoffrey E. Hinton et al. in 2016[4] to reduce the batch size constraints. Mainly because of the recurrent neural networks, which was unclear how to apply BN to them.

RNN architecture

In Deep Neural Networks, it’s easy to store statistics for each BN layer since the number of layers is fixed. However, in RNNs, the input and output shapes vary in length. So, in this case, it’s better to normalize using statistics of a single timestep(example) rather than the whole batch.

In this method, every example in batch(N) is normalized across [C, H, W] dimensions. Like BN, it speeds up and stabilizes training but without the constrain to the batch. Additionally, this method can be used in online learning tasks where the batch is equal to 1.

Instance Normalization

Source

Instance Normalization was introduced in the 2016 paper[5] by Dmitry Ulyanov et al. It was another attempt to reduce dependency on the batch to improve the results of the style transfer network.

Normalizing across batch and channel allows removing specific contrast information from the image, which helps with generalization.

This method gained popularity among generative models like Pix2Pix or CycleGAN and became a precursor to the Adaptive Instance Normalization used in the famous StyleGAN2.

Group Normalization

Source

Group Normalization was introduced in the 2018[1] paper, and it directly addresses the BN limitations for CNNs. The main accusation is distributed learning, where the batch is split into many machines. These are trained on a small number of examples like 6–8 and, and in some cases, even 1–2.

Distributed Learning

To fix it, they introduce a hybrid of layer and instance normalization. GN divides channels into groups and normalizes across them. This scheme makes computation independent of the batch sizes.

GN outperforms BN trained on smaller batches but can’t beat the larger batch results. Nevertheless, it was a good starting point that led to another method which combined with GN, exceeds the BN’s results.

ResNet-50’s Validation Error on ImageNet. Source

Weights

Weight Standardization

Source

We have already normalized inputs and layer outputs. The only thing that was left was weights. They can grow large without any control, especially when we normalize output anyway. By standardizing weights, we achieve a more smooth loss landscape and more stable training.

As I mentioned before, weight standardization is an excellent accompaniment to group normalization. Combining those methods produces better results than BN(with large batches) using only one sample per machine.

Red — ImageNet, Blue — Coco Dataset. Source

They also present the BCN method called Batch-Channel Normalization. In a nutshell it’s just BN and GN used at the same time for each layer.

Conclusions

Normalization is an essential concept in Deep Learning. It speeds up computation and stabilizes training. There are plenty of different techniques that evolved over the years. Hopefully, you got the underlying idea behind them and now know with certainty why and where to use them in your project.

Check my Medium and Github profile if you want to see my other projects.

References

[1] Group Normalization

[2] Micro-Batch Training with Batch-Channel Normalization and Weight Standardization

[3] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift

[4] Layer Normalization

[5] Instance Normalization: The Missing Ingredient for Fast Stylization

[6] Deep Residual Learning for Image Recognition

--

--