Breaking the ice with Batch Normalization

8 min readAug 2, 2018

Well Batch normalization was always in the air but I didn’t get much opportunity to try out and experience its power until recently when I was training a 3D CNN model and applying batch normalization reduced the validation loss/test loss and stabilized my model as it was stated in the paper of batch normalization.

To appreciate the essence of batch normalization let’s first discuss the mean normalization of input data. So as stated in the literature, most of us do the mean normalization of input data i.e. making the input data distribution to zero mean and unit variance. The formula to do that is

Where,
x — input
µ — mean of the input data
σ — Standard deviation of the input data

In Python to get zero mean and unit variance distribution you can use the below-stated code:

mean = numpy.mean(x) # compute the mean of the data
std = numpy.std(train) # compute the standard deviation of the data
x = x — mean # subtract the mean from every data sample to get zero mean (centered data)
x = x/std # divide it by the standard deviation to get unit variance

Centering the data to zero mean and scaling it in the range of 0–1 has various advantages in neural networks:

1. It makes your training faster.
2. It prevents you from getting stuck in local minima.
3. It gives better error surface, the error or loss are low and converges to low training or test error.

I trained a 2-layer fully connected network for 4 classes with a custom data-set and observe that when normalization was not used the training error got stuck in local optima, the error was not reducing at all and the cost function never converged. The training steps for 300 epochs are shown below

Training behavior without normalizing input data: Loss stuck at local minima — 12.1289 and accuracy not improving beyond 24 %

As you can see after just 2–3 iterations the loss value was 12.1289 and it never improved until 300 iterations and the same was with accuracy, it never increased above 24 %. The loss and accuracy curves are depicted below:

Loss and Accuracy curve — Training loss of 12.1289 and accuracy of 24%: Stalled due to un-normalized input data

Then I decided to do the mean normalization for input data and train it again for 300 iterations and yay! The network converged and learned well. The network converged and learned very quickly in a few iterations so then I trained it just for 60 iterations and results are presented

Training behavior with normalized data: the loss converged to 0.0097 and accuracy to 100% in just 60 iteration

Loss & Accuracy curve — With normalized data: Network learned very quickly and converged to a low error in just 50 iterations

As we can see just by making the input distribution to zero mean and unit variance:

1.The network came out of local minima and starts training well
2.The loss reduces to very small value and accuracy increases to almost 100%.
3. Normalizing the data sped up the training and network converged in very few iterations (just 50)

You can see from the experiment above that just by normalizing the input distribution we got a better model which learns faster, converge to low error, and does not get stuck at local minima. The improved behavior of learning by normalization of input is well explained by Andrew Ng in his course provided by deeplearning.ai

Assume you have a 2D input data X = [x1 x2]. The figure below from left to right shows the — Un-normalized distribution à distribution after subtracting the mean (zero mean or centralized distribution) à distribution after dividing by standard deviation (Zero mean unit variance distribution)

Normalizing training data. source: Andrew Ng deep learning course by deeplearning.ai

Assume, the x1 variable is in the range of 1 to 1000.

x2 variable is in the range of 0 to 1.

Since x1 and x2 features are in different ranges, the range of values that the learnable parameters w1 and w2 have are quite different.

As explained by Andrew Ng, if you use Un-normalized input features your cost function will be elongated and if you normalize your features the cost function will be more symmetric. Both the cost functions are depicted in the figure below:

(a) Elongated cost function for un-normalized input data; (b) Spherical cost function for normalized input data. Source: Andrew Ng deep learning course by deeplearning.ai

In Figure (a) when you have the elongated cost function, you may have to use a small learning rate because if w is initialized at the elongated end as shown by a red dot, it will need a lot of steps (iterations) to oscillate back and forth, till it reaches the minimum value i.e. center of the contour plot of the cost function

In Figure (b) when the data is normalized you will get a spherical cost function. As you can see in the contour plot, it doesn’t matter where w is initialized, we can use a higher learning rate and it will descend to minima without the oscillation. So the training is fast and it converges to a lower error value.

However, when we are using deep networks the input distribution for the hidden layers inside is no longer a normalized distribution, it changes due to the updating of weights/parameters and this may hamper the learning of the network. The idea of batch normalization is to ensure that the input distribution for every hidden layer is normalized. The abstract from the paper published in 2015 — “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” is described below and readers are advised to read it for better clarity

Training Deep Neural Network is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rate and careful parameter initialization, and makes it notoriously hard to train model with saturating non-linearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as regularizer, in some cases eliminating the need for dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters. — Sergey Ioffe & Christian Szegedy.

In Batch Normalization, batch signifies that the zero mean unit variance normalization is done over a batch of data samples.

The process to apply batch normalization is described in the equations below which are copied from the original paper.

Applying Batch Normalization to mini-batch of data. Source: *Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift*

I hope the equations are self-explanatory but still let me walk you through them.

Step-1: you have a mini-batch of data denoted by B. The mean of this mini-batch data is computed in first step
Step-2: compute the variance of the mini-batch data.
Step-3: subtract each of the data samples in the mini-batch with mean computed in step 1 and divide it by the standard deviation computed in step-2. The square root of variance is standard deviation. A small epsilon is added to the var(x) to avoid the division by zero error.
After step-3 the mini-batch data B has a distribution of zero mean and unit variance
Step-4: The normalized input is then passed through a linear function with parameters gamma (ϒ) and beta(β). This parameters are learn-able parameters and makes the batch-normalization quite powerful.

Say the network doesn’t want zero mean and unit variance, then the original distribution can be restored if gamma(ϒ) and beta(β) are learned such as

gamma(ϒ) = sqrt(var(x))
beta(β) = mean(x).

These learnable parameters are a huge advantage as they can learn any distribution during the training depending on the complexity of the problem. So for any layer you start by transforming the data to zero mean and unit variance but during training with the help of gamma (ϒ) and beta (β), they can learn any other distribution which may be better for the model.

The network with and without batch normalization can be elucidated in the figure below

source: https://wiki.tum.de/display/lfdv/Batch+Normalization

As BN has learnable parameters, while implementation we can consider it as another layer

So after understanding a little bit about what batch normalization is doing, I tried implementing it in 3D CNN and obviously, I used the most amazing library -Keras (Thanks fchollet). After some research online, I discovered that people use the BN layer before the activation layer: https://www.dlology.com/blog/one-simple-trick-to-train-keras-model-faster-with-batch-normalization/

I used batch normalization for the convolution layer and dense layer. My model is described below

My 3D CNN model with batch normalization for convolution layers and dense layer

The loss curve for training and test data without and with batch normalization is shown below

Loss curve- left: loss curve without BN; right: loss curve with BN. we can clearly see that applying batch normalization improves the test/validation loss

As we can see from the above curve, the use of batch normalization gave a similar loss value of close to zero for training data but the test/validation loss improves, learning gets stabilized and it converges to a lower error value.

So this was my first experience with Batch normalization and I am quite happy with its performance. I hope my experience will be helpful to others too. Thank you.

If you find my articles helpful and wish to support them — Buy me a Coffee

Breaking the ice with Batch Normalization

Written by Anuj shah (Exploring Neurons)