Summary of the Paper named Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models.
While proceeding to discuss batch renormalization, I assumed that you are quite familiar with Batch Normalization (BN). How does it help to converge faster to the optimal solution of the problem at hand? If not, please read batch normalization- a technique that enhances training.
Let’s briefly summarize BN:
- It helps to reduce the internal covariate shift (ICS) so the distribution of inputs to the activations remains more stable.
- BN makes us less careful about the scale of the parameters and their initialization.
- It allows us to use higher learning rates which help us to speed up the training.
As we can see above, BN is quite useful and effective at accelerating the procedure of convergence to the optimal solution. But, what’s the drawbacks lie in the procedure of normalizing the activations. We will try to understand it through this article. And also understand how does Batch Renormalization help to solve that problem?
And as Sergey Ioffe concludes in Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models that “it offers the promise of improving the performance of any model that would use batchnorm.” We try to understand why that is so.
As we know that, the BN transform cannot process activation in each training example because of the dependence of BN on both the training example and the other examples in the mini-batch. Although it makes BN powerful; it is also the source of its drawbacks.
Because, as we reduce the size of the minibatches the mean and variance which we use to normalize the inputs any layer per dimension becomes less accurate. And these inaccuracies are compounded with depth, which affects the quality of a model. And also paper points out that non-iid minibatches can have a bad effect on models with batchnorm.
I think by now, we understood where and why BN might be failing? So to solve this issue one of the authors of a BN paper himself introduce us to Batch Renormalization which removes the above discrepancies of Batch Normalization by retaining the benefits of BN such as insensitivity to parameters initialization and training efficiency.
How is batch renormalization different from BN?
As we know that in BN, moving averages are computed over the last several minibatches during training and only used for inference. But Batch Renorm does use these moving average mean and variance during training for correction.
Batch Renormalization is an augmentation of a network, which contains batch normalization layers with a per-dimension affine transformation applied to the normalized activations.
Suppose we have a Minibatch and want to normalize a particular node x using either the Minibatch statistics or their moving averages, then the results of both normalizations are related by an affine transform.
In practice, we treat parameters r and d as fixed. During the training phase, we start batchnorm alone for a certain number of iterations by keeping r = 1 and d=0 and then gradually change these parameters within certain bounds. Here’s the batch renormalization algorithm :
We stated that as we reduce the size of the minibatches, Batch Renormalization improves networks' accuracy significantly better than networks with BN. Also paper states that when examples in Minibatch are not sampled identically and independently (iid), then BN can perform poorly.
- Batch Renormalization reduces the dependence of processing activation of each example on the other examples in the minibatch and retains the benefits of BN.
- It works significantly well while using minibatches.
- It offers significant results for non-iid examples over BN.
I hope this will help you to understand when BN might be failing? And when Batch Renormalization will help?
Please feel free to show me an error or misinterpretation of any concept. This will help me a lot. Also if you think it will help friends of yours’ to understand this concept; please share it with them.