Understanding Deep Neural Networks for beginners — Part 3

5 min readJul 13, 2024

In part 2 we discussed non-saturating activation functions. Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t guarantee that they won’t come back during training.

Batch Normalization

In a 2015 paper, a technique called Batch Normalization (BN) was proposed to address the vanishing/exploding gradients problems. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result using two new parameter vectors per layer: one for scaling, the other for shifting.

In other words, this operation lets the model learn the optimal scale and mean of each of the layer’s inputs. In many cases, if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set (e.g., using a StandardScaler): the BN layer will do it for you (well, approximately, since it only looks at one batch at a time, and it can also rescale and shift each input feature).

In order to zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation. It does so by evaluating the mean and standard deviation of each input over the current mini-batch (hence the name “Batch Normalization”). The whole operation is summarized in the following equations:

μB is the vector of Input means, evaluated over the whole mini-batch B (it contains one mean per input).
σB is the vector of input standard deviations, also evaluated over the whole minibatch (it contains one standard deviation per input).
mB is the number of instances in the mini-batch.
x(i) is the vector of zero-centered and normalized inputs for instance i.
γ is the output scale parameter vector for the layer (it contains one scale parameter per input).
⊗ represents element-wise multiplication (each input is multiplied by its corresponding output scale parameter).
β is the output shift (offset) parameter vector for the layer (it contains one offset parameter per input). Each input is offset by its corresponding shift parameter.
ϵ is a tiny number to avoid division by zero (typically 10–5). This is called a smoothing term.
z(i) is the output of the BN operation: it is a rescaled and shifted version of the inputs.

So during training, BN just standardizes its inputs then rescales and offsets them. Good!

What about at test time?

Well it is not that simple. Indeed, we may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input’s mean and standard deviation.

Moreover, even if we do have a batch of instances, it may be too small, or the instances may not be independent and identically distributed (IID), so computing statistics over the batch instances would be unreliable (during training, the batches should not be too small, if possible, more than 30 instances, and all instances should be IID).

One solution could be to wait until the end of training, then run the whole training set through the neural network, and compute the mean and standard deviation of each input of the BN layer. These “final” input means and standard deviations can then be used instead of the batch input means and standard deviations when making predictions.

However, it is often preferred to estimate these final statistics during training using a moving average of the layer’s input means and standard deviations. To sum up, four parameter vectors are learned in each batch-normalized layer:

γ (the output scale vector) and β (the output offset vector) are learned through regular backpropagation
μ (the final input mean vector), and σ (the final input standard deviation vector) are estimated using an exponential moving average.

Note that μ and σ are estimated during training, but they are not used at all during training, only after training (to replace the batch input means and standard deviations in the above equation).

The authors demonstrated that this technique considerably improved all the DNNs they experimented with, leading to a huge improvement in the ImageNet classification task (ImageNet is a large database of images classified into many classes and commonly used to evaluate computer vision systems). The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function.

The networks were also much less sensitive to the weight initialization. They were able to use much larger learning rates, significantly speeding up the learning process. Specifically, they note that “Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

You may find that training is rather slow, because each epoch takes much more time when you use batch normalization. However, this is usually counterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach the same performance. All in all, wall time will usually be smaller (this is the time measured by the clock on your wall).

Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.” Finally, like a gift that keeps on giving, Batch Normalization also acts like a regularizer, reducing the need for other regularization techniques (such as dropout).

Batch Normalization does, however, add some complexity to the model (although it can remove the need for normalizing the input data, as we discussed earlier). Moreover, there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer. So if you need predictions to be lightning-fast, you may want to check how well plain ELU + He initialization perform before playing with Batch Normalization.

Batch Normalization has become one of the most used layers in DNNs, to the point that it is often omitted in the diagrams, as it is assumed that BN is added after every layer. However, a 2019 paper shows that by using a novel fixed-update (fixup) weight initialization technique, they manage to train a very deep neural network (10,000 layers!) without BN, achieving state-of-the-art performance on complex image classification tasks.

Gradient Clipping

Another popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This technique is most often used in Recurrent Neural Networks, as Batch Normalization is tricky to use in RNNs. For other types of networks, BN is usually sufficient.

Understanding Deep Neural Networks for beginners — Part 3

Batch Normalization

What about at test time?

Gradient Clipping

Written by Chamuditha Kekulawala