Glossary of Deep Learning: Batch Normalisation

Jaron Collis
Jun 27, 2017 · 4 min read

Batch normalisation is a technique for improving the performance and stability of neural networks, and also makes more sophisticated deep learning architectures work in practice (like DCGANs).

The idea is to normalise the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is analogous to how the inputs to networks are standardised.

How does this help? We know that normalising the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network.

Thought of as a series of neural networks feeding into each other, we normalising the output of one layer before applying the activation function, and then feed it into the following layer (sub-network).

In Keras, it is implemented using the following code. Note how the BatchNormalization call occurs after each fully-connected layer, but before the activation function and dropout.

from keras.layers.normalization import BatchNormalization
model = Sequential()
# think of this as the input layer
model.add(Dense(64, input_dim=16, init=’uniform’))
# think of this as the hidden layer
model.add(Dense(64, init=’uniform’))
# think of this as the output layer
model.add(Dense(2, init=’uniform’))
# optimiser and loss function
model.compile(loss=’binary_crossentropy’, optimizer=sgd)
# train the model, y_train, nb_epoch=50, batch_size=16)

Batch normalisation was introduced in Ioffe & Szegedy’s 2015 paper. The idea being that, instead of just normalising the inputs to the network, we normalise the inputs to layers within the network. It’s called “batch” normalization because during training, we normalise the activations of the previous layer for each batch, i.e. apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Beyond the intuitive reasons, there are good mathematical reasons why it helps the network learn better, too. It helps combat what the authors call internal covariate shift. This is discussed in the original paper and the Deep Learning book (Goodfellow et al), in section 8.7.1 of Chapter 8.

Benefits of Batch Normalization

The intention behind batch normalisation is to optimise network training. It has been shown to have several benefits:

As batch normalisation helps train networks faster, it also facilitates greater experimentation — as you can iterate over more designs more quickly.

It’s also worth bearing in mind other techniques for inter-layer normalisation exist, such as instance normalisation, (Ulyanov et al), which is used in the CycleGAN architecture.

Deeper Learning

Learning and Applying Artificial Intelligence

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store