9. Introduction to Deep Learning with Compute Vision — Normalization & Batch Normalization
Written by Nilesh Singh & Praveen Kumar.
After learning about Activation functions, which are very crucial for our deep neural network as we move on to build more complex and amazing architectures, its time we focus on more important concepts.
What is normalization?
Okay, let us suppose you and your friend are working in a company on different projects. After the completion of the project, you both are supposed to get a bonus. Let’s suppose you get a bonus of 1000 rupees per month, not good but you still accept it. Then you head over to your friend and ask about his bonus, apparently, he also got a bonus of 1000 rupees, ahh the relief. But as it turns out he got it on a daily basis. Ughh, not fair, so not fair.
Normalization is used to prevent the exact same thing happening inside a neural network. Normalization, specifically pixel normalization is used to assign every pixel of an RGB image to an appropriate point in the 0 to 255 scale. This is done by scaling the brightest pixel to 255 and the darkest pixel to 0. In simple terms, it redistributes all the pixels throughout the whole scale and helps in preventing pixel intensity clustering.
You can observe the difference in the quality of features in image just with normalization, where the bottom image is of-course pixel normalized.
Now let us talk about image normalization,
let us suppose, we have a dataset consisting of the following images.
The idea behind image normalization is that we calculate the mean of all the images given in the dataset, then we calculate the variance of all images in the dataset. The mean (left) and variance (right) would look somewhat like this:
Then we subtract the mean from each image and divide it by the standard deviation. What does it do?
As is evident from the updated dataset, the features on the face are highlighted, that is, the relevant features are highlighted and distortion of features happens on the sides, i.e. irrelevant features are weeded out.
Data normalization, in turn, ensures weight normalization inherently. If our relevant features are uniform and not overblown, then the weight corresponding to them will also be uniform. This ensures that backpropagation works efficiently.
For example, if we have two weights W1=0.5 and W2=100. After a backpropagation step, if the intended weight update is 0.3, then W1=0.5+0.3=0.8 and W2=100+0.3=100.3. Both weights are updated by the same value but the effect on W1 is far more pressing than the effect on W2 thus rendering the backpropagation step much less potent (think about the bonus example).
But, if our data would have been normalized in the first place then W1 and W2 would have been of similar scale and the backpropagation could have worked its magic with much ease.
So far we’ve discussed normalization that happens at the beginning of each epoch, but what happens during the epoch, who’s to stop the weights from growing non-uniformly?
Well, this is where Batch Normalization comes into the picture.
Batch Normalization (BN)
Before going into BN, we would like to cover Internal Covariate Shift, a very important topic to understand why BN exists & why it works.
Whenever we want to train a network, we provide a batch of input from the complete dataset and these batch keep changing until the epoch is completed. In simpler words, the whole dataset is divided into batches of images with each image in a batch being processed parallelly on the GPU. Training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers. So, small changes in the network parameters amplify as the network becomes deeper. The training proceeds in steps, at each step considering a mini-batch. Thus, each mini-batch has a different distribution from different parts of the dataset. So, the network has to continuously adapt to new distribution. whenever this input distribution changes, it is called Covariate Shift. Internal covariate shift simply means the changes inside network layers. Let’s look at an example.
From fig 5, we can see that the distribution varies from different datasets. This would lead to a failure model when deployed since the type of distribution the model is trained on is very different than test distribution.
Another intuitive way of saying is, we have many many kernels in our neural network and a kernel is responsible for extracting features from input images. Suppose we are trying to predict a cat. So, we can have different colors of cats as well. For simplicity, assume we have 3 different kinds of colored cats, that is, a white cat, a red cat, and a black cat. So our kernel is supposed to shout that I have found a cat in this image. But the confidence value will change as different colored cats will appear in an image. This is because, for a black cat, the input pixel values will be much larger(200–255) when compared to a white cat(0–50), whose pixel values will be less. Thus, the kernel will have different output values and when these output values are passed on the successive layers, the information is lost between different kinds of layers and activations. So we need to make sure all 3 cat inputs are between the same distribution.
Thus our aim is to centralize these distributions(at least during training & validation) which vary on each mini-batch. This is the main reason why BN comes into the picture.
Let’s dive into BN now and the maths behind it. To understand it, follow us carefully here with the following example.
Assume your batch size is 10. So this means that at any given time, you are giving 10 images as input to the network. Also, assume that you have 3 layers in your network and 32,64, and 128 are the number of channels in each consecutive layer respectively. So this means Layer 1 has 32 channels, Layer 2 has 64 channels & Layer 3 has 128 channels. Now we will focus on one layer, say the first layer, which has just 32 channels and our batch size is 10 images. Carefully look at the following diagram (read it from left to right).
The first question is how are there 320(32 channels x 10 images for each channel) images even though our batch size was 10, right? This is what we call mini-batches. We know that GPU works in parallel. So when we specify 32 channels, we are making 32 different kernels inside GPU parallelly where each kernel is responsible for extracting a different feature. So each kernel is passed those 10 images instead of passing those 10 images one by one. Thus, 32 kernels work on the same images parallelly. Hence in the memory, it basically forms 320 images. Thus, we still have 10 images as our batch size, but now 32 mini-batches of the same 10 images. This is why BN has the term B in it, which means we work on batches inside layers, otherwise, it could simply be coined as Normalization.
So, BN calculates mean & Variance for each mini-batch(x1,x2,…,x32) and updates that mini-batch(x1`,x2`,…,x32`) and passes to next layer and so on. Thus a very important point to note here is:
The number of Means & Variance is equal to the number of channels in a particular layer. Also, for each mini-batch, Mean & Variance values are different. So, Mean & Variance for x1 is different from the mean & variance of x2 and soo on.
Let’s look at what the original paper suggests.
Fig 7 shows exactly what we have seen in the previous figure. For each layer and for each mini-batch, we subtract the mean & divide by variance to normalize that current mini-batch and similar for all the mini-batches. We pass this updated & normalized mini-natch and ask the model to learn features from it. Do not mind the parameters Gamma & Beta. Those are simply the learnable parameters for the model. Basically they exist to make sure that this mini-batch normalization does not skew the data at hand towards a wrong direction. If our normalization is perfect, then Beta would be 0 and gamma would be 1. Let’s look at the batch normalization algorithm now.
The algorithm could not be much simpler for us now. We have just seen those exact steps. For each mini-batch, we calculate the mean & variance. We then subtract from the mini-batch to get updated mini-batch and then use this updated mini-batch to learn features.
NOTE: We have an epsilon term with Variance in the denominator because we try to avoid the pitfall of divide by zero exception.
That’s all is Batch Normalization. Easy when visualized and tricky enough to confuse.
If you need a deeper understanding of Batch normalization & the complete math behind it, please go through this paper presented by Google at the 32nd International Conference on Machine Learning, Lille, France, 2015.
Hope you enjoyed it. See you soon!
NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]