BatchNorm and LayerNorm

Florian June
2 min readNov 9, 2023

--

Why do we need normalization?

Normalization is essentially the process of unifying non-standard data into a specified format.

On one hand, as the depth of the network increases, the distribution of feature values in each layer will gradually approach the upper and lower ends of the activation function’s output range, causing the activation function to saturate. Continuing in this way can lead to gradient vanishing. Normalization can recenter the distribution of feature values to a standard normal distribution, ensuring that the feature values fall within the range where the activation function is more sensitive to inputs, thereby avoiding gradient vanishing and speeding up convergence.

On the other hand, in the field of machine learning, there is an important assumption called the Independent and Identically Distributed (IID), which assumes that the training data and test data follow the same distribution. This assumption is a fundamental guarantee that a model trained on the training data can perform well on the test set. Normalizing the training and test data can prevent the influence of different data distributions on the model.

Difference between Batch Normalization and Layer Normalization

BatchNorm normalizes each feature within a batch of samples, while LayerNorm normalizes all features within each sample.

Let’s assume we have a two-dimensional input matrix, where the rows represent the batch and the columns represent the sample features. The target of Batch Normalization is a batch of samples, and the target of Layer Normalization is a single sample, Figure 1 illustrates this concept:

Figure 1

Applicable Fields

For the field of computer vision, features depend on the statistical parameters between different samples, and BatchNorm is more effective. This is because it eliminates the size relationship between different features while preserving the size relationship between different samples.

In the field of NLP, LayerNorm is more appropriate. This is because the different features of a single sample are actually the variations in words over time, and the feature relationships within the sample are very close.

Furthermore, the latest AI-related content can be found in my newsletter.

--

--

Florian June

AI researcher, focusing on LLMs, RAG, Agent, Document AI, Data Structures. Find the newest article in my newsletter: https://florianjune.substack.com/