Glossary of Deep Learning: Batch Normalisation

Batch normalisation is a technique for improving the performance and stability of neural networks, and also makes more sophisticated deep learning architectures work in practice (like DCGANs).

The idea is to normalise the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is analogous to how the inputs to networks are standardised.

How does this help? We know that normalising the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network.

Thought of as a series of neural networks feeding into each other, we normalising the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).

In Keras, it is implemented using the following code. Note how the BatchNormalization call occurs after each fully-connected layer, but before the activation function and dropout.

from keras.layers.normalization import BatchNormalization
model = Sequential()
# think of this as the input layer
model.add(Dense(64, input_dim=16, init=’uniform’))
# think of this as the hidden layer 
model.add(Dense(64, init=’uniform’))
# think of this as the output layer
model.add(Dense(2, init=’uniform’))
# optimiser and loss function
model.compile(loss=’binary_crossentropy’, optimizer=sgd)
# train the model, y_train, nb_epoch=50, batch_size=16)

Batch normalisation was introduced in Ioffe & Szegedy’s 2015 paper. The idea being that, instead of just normalising the inputs to the network, we normalise the inputs to layers within the network. It’s called “batch” normalization because during training, we normalise the activations of the previous layer for each batch, i.e. apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Beyond the intuitive reasons, there are good mathematical reasons why it helps the network learn better, too. It helps combat what the authors call internal covariate shift. This is discussed in the original paper and the Deep Learning book (Goodfellow et al), in section 8.7.1 of Chapter 8.

Benefits of Batch Normalization

The intention behind batch normalisation is to optimise network training. It has been shown to have several benefits:

  1. Networks train faster — Whilst each training iteration will be slower because of the extra normalisation calculations during the forward pass and the additional hyperparameters to train during back propagation. However, it should converge much more quickly, so training should be faster overall.
  2. Allows higher learning rates — Gradient descent usually requires small learning rates for the network to converge. As networks get deeper, gradients get smaller during back propagation, and so require even more iterations. Using batch normalisation allows much higher learning rates, increasing the speed at which networks train.
  3. Makes weights easier to initialise — Weight initialisation can be difficult, especially when creating deeper networks. Batch normalisation helps reduce the sensitivity to the initial starting weights.
  4. Makes more activation functions viable — Some activation functions don’t work well in certain situations. Sigmoids lose their gradient quickly, which means they can’t be used in deep networks, and ReLUs often die out during training (stop learning completely), so we must be careful about the range of values fed into them. But as batch normalisation regulates the values going into each activation function, nonlinearities that don’t work well in deep networks tend to become viable again.
  5. Simplifies the creation of deeper networks — The previous 4 points make it easier to build and faster to train deeper neural networks, and deeper networks generally produce better results.
  6. Provides some regularisation — Batch normalisation adds a little noise to your network, and in some cases, (e.g. Inception modules) it has been shown to work as well as dropout. You can consider batch normalisation as a bit of extra regularization, allowing you to reduce some of the dropout you might add to a network.

As batch normalisation helps train networks faster, it also facilitates greater experimentation — as you can iterate over more designs more quickly.

It’s also worth bearing in mind other techniques for inter-layer normalisation exist, such as instance normalisation, (Ulyanov et al), which is used in the CycleGAN architecture.