Understanding Batch Normalization

Introduction

6 min readAug 10, 2018

Batch Normalization is a very well know method in training deep neural network. Batch Normalization was introduced by Sergey Ioffe and Christian Szegedy from Google research lab. Batch Normalization is about normalizing the hidden units activation values so that the distribution of these activations remains same during training. During training of any deep neural network if the hidden activation distribution changes because of the changes in the weights and bias values at that layer, they cause rapid changes in the layer above it. This slows down the training a lot. The change in distribution of the hidden activations during is called internal covariate shift which effect the training speed of the network. The batch normalization is though and designed to overcome the problem of internal covariate shift. In 1989, this paper by Yann Lecun, Leone bottou and few others named “Efficiant BackProp “ shows that doing normalization of the input data or whitening of the input data helps in training neural network very efficiently. This same method should applied at the activation level also in order to efficiently learn the deep neural networks. Consider the picture shown bellow.

Here we have a deep neural network with 3 hidden layers along with an input and an output layer. Each hidden layer has its own weight matrices and bias vectors as shown in figure. The input at each layer goes through some affine transformation using weight matrix and bias vector. For example output from hidden layer L2 acts as input Hidden layer L3. The layer 2 (L2) hidden activation values are transfromed by multiplying Layer 3 weight matrix and added with bias values. This output is passed through an activation function like sigmoid , relu or tanh and output of hidden layer(L3)is obtaned. This process gets repeated at every hidden layer. As we saw, the layer 3 activations are actually affected by layer 2 activations. If the distribution of the layer 2 activation values changes rapidly then it affects the efficiency in training of the Deep neural network.

Consider the picture shown above where we look at the one single Deep neural network into multiple subnetworks. During deep neural network training we apply normalization to the inputs [x1,x2,…….xN] to train the Deep neural network efficiently. Now if we split one single DNN into multiple subnetworks the we can think of the hidden activations from layer 1 as the inputs to the second subnetwork. Now we can think of applying the same normalization to the hidden activation of layer 1(inputs to layer2).

Batch Normalization

In the above section we saw what is the necessity of applying Normalization to the hidden unit activation values. In this section we will see how to normalize the hidden activations. Normalization of any data is about finding the mean and variance of the data and normalizing the data so that the the data has 0 mean and unit variance. In our case we want to normalize each hidden unit activation. Consider we have d number of hidden units in a hidden layer of any Deep neural network. We can represent the activation values of this layer as x=[x1,x2,………xd]. Now we can normalize the kth hidden unit activation using the formula bellow.

Here x^ the normalized value of the kth hidden unit. E(x^k) is the expectation of the kth units values also called the mean value and Var(x^k) is the variance of the kth hidden unit. After normalization each hidden unit will have zero mean and unit variance but we typically do not want 0 mean and variance of 1. Instead we want the network to learn and adapt these mean and variance values. For this we introduce 2 new variable, one for learning the mean and other for variance. These parameters are learned and updated along with weights and biases during training. The final normalized scaled and shifted version of the hidden activation for the kth hidden unit is given bellow.

Typically when we are training any deep neural network we don’t feed the entire data in one shot because of the computation complexity increases. Instead neural networks are training with stochastic optimization techniques where small batch of data is sampled from the whole dataset and the network parameters are updated based on the loss values of that batch. The assumption of that the optimization will still find a better minima as described in the classic work by Herbert Robbins on stochastic optimization. In training any Deep neural network we use mini-batch of size 32,64,128,etc. So in that case batch normalization can be applied as described in the steps bellow.

Assume we have a minibatch of m training examples. We pass this minibatch to our neural network. At layer i we get The hidden activations matrix Hi. We then compute the mean and variance for each column as shown in figure bellow and apply batch normalization transformation .

During training we update the batch normalization parameters along with the neural networks weights and biases. One more important observation of batch normalization is that, batch normalization acts as a regularization because of the randomness shown by using mini-batches.

Batch Normalization during inference

During testing or inference phase we can’t apply the same batch-normalization as we did during training because we might pass only sample at a time so it doesn’t make sense to find mean and variance on a single sample. For this reason we compute a running average of mean and variance of kth unit during training and use those mean and variance values with trained batch-norm parameters during testing or inference phase.The process can be understood from the picture bellow which explains the step during inference phase.

How to use Batch-Normalization in Deep learning libraries

Tensorflow

In Tensorflow you can use tf.nn.batch_normalization api to add it to your deep neural networks. The speed of training increases by 14x as quoted in the original paper. This API normalizes the mean and variance and applies the batch-norm transformation.

Pytorch

In pytorch we can use torch.nn.BatchNorm2d or to apply batch norm to your neural network layer. The picture bellow is the code that i wrote for 1d convolution for speech signals which use batch-norm at every convolution layer.

About Me

I currently work in an AI company based in Bangalore called Cogknit Semantics. We work on Speech, Computer vision and NLP problems. We have built very good solutions for any Speech, Image or NLP problems. We have published many papers in both national and international conferences. Our speech team is the runner up in building speech recognition system for 3 Indian languages conducted by Microsoft. Feel free to chat with us. Visit our company website here.