BatchNormalization- a technique that enhances training
Why is this most often used normalization technique in neural architectures?
While reading the BatchNormalization (BN) paper written by Sergey Ioffe and Christian Szegedy. I came across the fact that it is cited by around 29.5k till now. Also, noticed and I think we as ML practitioners are using batch normalization techniques very often.
- But now the question comes to mind why are we using this technique?
- What are the benefits of using such techniques in the neural architecture we build?
In the paper, the authors themselves describe why are they using the batch normalization technique is that to accelerate deep network training. Now the question that comes to mind is how and which factor restricting a network to accelerate the learning procedure. Again we get the answer from the paper itself.
Training a Deep neural network(DNN) is complicated by the fact that after applying non-linearities on the previous layer outputs as inputs to the next layer; during training, the distribution of inputs changes because the parameters of previous layers change. This fact slows down the training by requiring lower learning rates along with careful parameter initialization. This is called an internal covariate shift(ICS). This fact slows down the training. And authors claim that BN tries to suppress it and accelerate the training procedure of DNN.
To optimize the problem at hand, we use different versions of gradient descent algorithms. As we know the stochastic gradient is always effective but it requires careful tuning of hyper-parameters, specifically learning rates and parameter initialization. But due to internal covariate shift, training becomes complicated as this shift amplifies as we go deeper. So each time layers need to be adaptive to the change. That’s why it requires lower learning rates which increase the training time.
BN tries to reduce the internal covariate shift so the distribution of nonlinearity inputs remains more stable and less likely to get stuck in the saturated region while it trains. And it results in accelerating the training to optimize for the optimal solution.
But along with these benefits, there are others too such as :
- BN has a beneficial effect on the gradient flow through the network, as gradients become independent of the scale of the parameters/weights (W) or their initial values.
- Subsequently, it allows us to use higher learning rates without the risk of divergence.
- Reduce the need for the dropout by regularising the model.
As LeCun et. al. and Wiesler et.al suggests that network training converges faster if inputs are whitened i.e. inputs are linearly transformed to have zero mean and unit variances and also decorrelated. So, as each layer in the architecture takes as inputs from its previous layer, then it would be advantageous to whiten the input of it.
So to do whitening of the inputs, there’s some catch is that when we calculate the normalization parameters outside the GD step, the model blows up so normalization does not help to reduce the loss although normalization happens. And the issue here is that GD optimization does not take into account the normalization that takes place.
Now, full whitening of each layers’ inputs is costly, so we normalize the scalar feature independently by making it have zero mean and unit variance. That means if we have a d-dimensional input vector x as
we normalize each dimension as:
But while normalizing the input layer, there might be a change in the representation of the layer, so to intact the representation we should have to make sure the transformation inserted in the network can represent the identity transform. That’s why the introduction of
comes into play which can scale and shift the normalized value and restore the representation power of the network.
That’s how introducing these two parameters per dimension suppress the ICS and accelerate the training.
- ICS slows down the training due to requirements of lower learning rates and careful parameter initialization.
- BN is often used in neural architecture to reduce the ICS which ultimately accelerate the training procedure.
- But to make GD aware of the normalization and intact the representation power of the network we need to introduce two parameters in the network per dimension.
Please feel free to correct me if I will be interpreting something wrong. And share it with others.