Batch Normalization–BNᵧ ᵦ

Viceroy
unpack
Published in
7 min readJan 4, 2021
A [B]umpkin [B]atch. Photo by Vito Natale on Unsplash

On a post-it note:

Batch norm is an algorithm to allow a neural network to standardize every node’s weights and biases to whatever range and domain it can best operate on.

Putting it even more casually: we want all our data to be normal.

Batch Normalization allows us to do this.

The premise in practice stems from how we know input layer 1 (your processed training data with values x₁, x₂, …, xₙ), trains its parameters (weights w₁, bias b₁) better/faster when normalized — centered around a normal distribution X~𝒩(μ, σ).

Batch Normalization (BN) asks if we can normalize values in all subsequent hidden layers, like activated layer 1 (a₂), as to train w₂, b₂ better/faster?

Source: Niranjan Kumar, Batch Normalization and Dropout in Neural Networks with Pytorch [2]

Exploring the photo above, after moving from layer 1 (the input layer) with values x₁, x₂, …, xₙ to activated layer a₁, through adjusted parameters w₁ and b₁, we would normalize the values z₁, z₂, …, zₘ in a₁ with Batch Norm so that parameters w₂ and b₂, would train faster/better. And then hidden layer a₂, and so on. And it really does train MUCH faster.

This simple, intuitive, marginal adjustment to the neural network architecture procedure was revolutionary.

Surprisingly this wasn’t convention until a seminal 2015 paper by a pair of Google researchers, Christian Szegedy and Sergey Ioffe, who showed it could apply Batch Normalization to the best performing ImageNet classification network, and matched its performance using only 7% of the training steps.

Source: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [3]

What exactly is the Batch Normalization algorithm though, actually what even is the concept of normalization or batches?

Normalization is the process of bringing scrambled data in a large range of values into a standardized, or ‘normal’ range. This is usually done with the formulas below. This brings our data from a range 0–60 and domain of -200 to 0, into a “normalized” x-y plane of -2 to 2.

Source: Andre Ye, Batch Normalization: The Greatest Breakthrough in Deep Learning [4]

A batch is a subset of data, or a sample, of the larger dataset. Usually contains around 64, 128 or 512 unique data inputs depending on the size of the dataset. It essentially acts like a representative sample.

Putting this together, we get Batch Normalization, aka standardizing a sample of the dataset at every layer. The reason we don’t use the entire dataset each time is because it would be too computationally heavy and allows for a regularization effect. More on that later.

This function, linear map, transformation, etc, is called the Batch Normalizing Transform.

Why isn’t it called Batch Normalizing Transformation? no idea.

The BN algo is as follows:

Implementing BN — Given some intermediate values (z₁, z₂, …, zₘ) of an arbitrary hidden layer, we apply the algorithm:

//mini-batch mean

μ_𝔅 is the empirical mean, calculated as you do in 4th grade, by adding up all the values (z₁, z₂, …, zₘ) and dividing by the number of values you counted (m).

//mini-batch variance

σ²_𝔅 is calculated by summing each data point’s difference from the mean value — after having squared it to avoid negative numbers — then dividing by the number of values you counted (m).

//normalize

x̂ᵢ normalizes with the constant ε added to prevent dividing by too small values, or zero. This more or less gives us the standard mean = 0 and unit variance = 1.

// scale and shift

If we don’t the distribution of our hidden nodes (z₁, z₂, …, zₘ) to always have mean 0, var 1, so we adjust our values in the activation layer with scaler gamma (γ) and shifter (offset) beta (𝛽), where γ, 𝛽 are learnable parameters of the model, hence the formula BNᵧ ᵦ. This allows us to set the mean of yᵢ, γ, to be whatever we want it to be. Same goes for 𝛽.

If its decided that γ=1 and 𝛽=0, yᵢ is no different than x̂ᵢ. Otherwise put, we can always rearrange this final equation to get the previous value:

x̂ᵢ = (yᵢ — 𝛽)/γ

When is this “scale and shift” helpful? Well for example if we have a sigmoid activation function, we might want to have a larger (wider) variance or a mean other than zero to take advantage of the non linearity of the sigmoid function. Rather than centering the cluster of all our values in the linear region of sigmoid (seen in graph a), using scaling and shifting parameters γ and 𝛽, we can expand to the range of values which the machine learning algorithm can center wherever it wants, with whatever variance it wants (b). This allows for the standardized mean and variance of our hidden layers to be determined through what the algorithm thinks is appropriate, which can even be mean = 0, and variance = 1.

Source: 【深度学习】批归一化(Batch Normalization)[5]

Going further, BN transform is a differentiable transformation that introduces normalized activations into the network. While training we need to backpropagate the gradient of loss (ℓ) through this transformation, as well as compute the gradients with respect to the parameters (γ, 𝛽) of BN transform. Thus, going from the final formula in BN transform backwards, we can return to hidden layer before.

Source: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [3]

In equation (1) we take the derivative with respect to yᵢ to get x̂ᵢ. Using that equation, we can get the derivative with respect to the variance (σ²) in (2).

Breaking this into two parts by chain rule, we get the derivative with respect to the mean (μ). And combining (1), (2) and (3) we can get our initial inputs for every xᵢ.

Equations (5) and (6) help us get our previous parameters (γ, 𝛽) of BN transform. For a clear tutorial of each step, watch Mr Yannic Kilcher’s tutorial on youtube.

Note

There is an unintended side-effect — BN has an ever-so-slight regularization effect.

Each mini-batch 𝔅 of size m (z₁, z₂, …, zₘ) is scaled by the mean and variance computed on just that one mini batch. Because it’s scaled by just the mean and variance from one sample subset of size m=64 or 128, rather than the mean/variance of the entire data set, the batch’s mean/variance has some noise to it. And because the mean/variance is noisy (since it is estimating with a small sample of data) the scaling process, going from (z₁, z₂, …, zₘ) to (~z₁, ~z₂, …, ~zₘ), is a bit noisy as well.

Similar to dropout, it adds noise to each hidden layer’s activations. Dropout has multiplicative noise from multiplying by 0 or 1, whereas BN has multiplicative noise (from multiplying by the standard dev, σ) and additive noise by subtracting (aka negative addition) the mean, μ. By adding noise to each hidden node, similar to dropout, the downstream hidden node are forced to not rely on any one hidden unit, and therefore has a regularization effect.

However, this in itself is a very small effect. To be more rigorous you can use BN together with dropout for a powerful regularization effect.

Also note that by using a bigger mini batch size, say 512 instead of 64, you reduce the noise (the sample is larger and more representative of the dataset), and as such you reduce the regularization effect.

Conclusion

We know that normalizing the input features into a smaller range of values (e.g. between 0–1, -2 to 2, or with logistic sigmoid) can speed up training the parameters, weights w and bias b. The intuition behind Batch Normalization is that we do the same thing, but for each hidden layer of our neural network, not just the initial input matrix.

--

--