Batch Normalization Explained!
The article explains batch normalization
The objectives for batch normalization are as follows:
1. Speeds up the training process
2. Decreases the importance of initial weights
3. Regularizes the model
Let us consider an example to better understand the concept of batch normalization. For instance, there are 2 independent variables or features “Age” and “Height”. The values for these features are given below.
As we can see in the above table the range of “height” feature is small compared to the “age” feature. This means that small changes or fluctuations in the Age feature can have a significantly higher impact on the dependent feature as compared to the the Height feature. This in turn assigns the predictive capabilities of the independent features solely based on their range instead their linear or non-linear relationships with the dependent feature. In such scenario, we need to have small learning rate (LR) so that we don’t overshoot the minimum, reducing the training speed of the model.
When batch normalization is implemented the data has a mean of 0 and standard deviation of 1. The makes the cost function more symmetric in nature.
The below diagram depicts the shape of the data before and after the batch normalization.
As the shape of the data becomes symmetric it becomes easy to process the data irrespective from which point we start to train the model. As the global minima is equidistant from all the outer edges. Conversely, training the data without batch normalization becomes difficult process as it is inconvenient to transverse a terrain of non symmetric shape.
When the data is non symmetric it will take longer training duration or smaller training time to reach global minima based on the point from where we start to train the model.
The below diagram will given a better understanding of the concept.
As we can see in the above diagram if we start to train the model from Point A it will take 1000 iterations to reach global minima. However, if we start to train the model from Point B it will reach global minima in 100 iterations. This gives rise to uncertainly with respect to the time required to reach global minima.
If we use batch the batch normalization the uncertainty in terms of time required to reach global minima is significantly reduced. For instance, if will start to train the model from Point C it will take 150 iterations to reach global minima. Similarly, if will start to train the model from Point D it will take 155 iterations to reach global minima. This symmetric nature of the data brings stability to the model.
It also regularizes the model as every batch is composed of random samples. It works similar to a dropout layer randomly selecting rows and samples making the model more robust.