Deep Learning Course — Lesson 10.5: Batch Normalization and Layer Normalization

3 min readJun 12, 2023

Batch Normalization

Batch Normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. It was introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.

Batch Normalization works by calculating the mean and variance of the layers’ inputs and performing a normalization operation. It then applies a scaling and shifting operation, where two new parameters (a scale and a shift) are learned during the training. The values of these parameters determine whether batch normalization has any effect at all.

Here are the key steps:

First, calculate the mean and variance of the input batch.
Normalize the batch by subtracting the batch mean and dividing by the batch standard deviation (calculated from variance). This will result in a batch with a mean of 0 and a standard deviation of 1.
Multiply the normalized batch by a learned parameter (gamma) and add another learned parameter (beta). Gamma and beta are learned just like weights with backpropagation.

By normalizing in this way, Batch Normalization helps to deal with the internal covariate shift problem, where the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This could slow down training and make it hard to train deep networks.

Batch Normalization also has a slight regularization effect, similar to Dropout. It adds some noise to each layer’s activations. Therefore, if you are using Batch Normalization, you might not need to use Dropout or you might need less dropout.

Layer Normalization

Layer Normalization is another type of normalization technique that is quite similar to Batch Normalization. It was introduced by Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton in their 2016 paper “Layer Normalization”.

While Batch Normalization performs normalization on the batch dimension, Layer Normalization performs the normalization on the feature dimension. In other words, Layer Normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Unlike Batch Normalization, Layer Normalization performs exactly the same computation at training and test times.

This makes Layer Normalization particularly useful for recurrent neural networks, as batch normalization is tricky to apply in recurrent networks due to the dynamic nature of their size.

Here are the key steps:

Calculate the mean and variance of the input of a single layer across its feature dimension.
Normalize the layer inputs by subtracting the mean and dividing by the standard deviation.
Multiply the normalized inputs by a learned parameter (gamma) and add another learned parameter (beta).

By normalizing across the feature dimension, Layer Normalization ensures a stable distribution of features across different layers in the network, which can help improve the stability of neural network training.

Both Batch Normalization and Layer Normalization are commonly used in deep learning models, and they can significantly improve model performance, both in terms of speed of training and final model accuracy. However, the choice between Batch Normalization and Layer Normalization usually depends on the specific use case and the type of neural network used.

Deep Learning Course — Lesson 10.5: Batch Normalization and Layer Normalization

Batch Normalization

Here are the key steps:

Layer Normalization

Here are the key steps:

Written by Machine Learning in Plain English