[ Archive Post ] Batch Normalization and Internal Covariate Shift

Jae Duk Seo
3 min readNov 17, 2018

Please note that this post is for my own educational process.

paper from this website
paper from this website
Paper from this website

Main theory Behind why Batch Nor Works (Motivation / Abstract)

After each layer the distribution of the data can change, and this change make it harder for the network to extract useful features, since when we are training in batch, every batch have different distribution.

In other words, BN fixes the first and second moment of the data (mean and variance) to zero and one respectfully. And this is most widely accepted as the reason for why batch norm works so well. (This is reduction of internal covariate shift)

Questions About Batch Normalization

And from the figure above, we notice that when we use batch norm we are able to have a higher learning rate. However, the distribution of the data for each layer does not seem to differ that much, when we compare with/without batch norm.

When we inject noise after batch norm layer to make a severe covariate shift, it is shown to still have much better performance when compared to original network. So the input distribution might not be the best idea when describing why batch norm works.

The smoothing effect of BatchNorm

So long story very short it turns out that changing the distribution of given data for every each layer is not the direct reason why batch norm works so well, rather the reason is due to smoothing of the error surface when optimizing a given network.

‘loss changes at a smaller rate and the magnitudes of the gradients are smaller too’

So with batch norm the error surface less local minimum as well as flat regions so in conclusion the training itself becomes much more stable.

The above image tells it all, when we use a lp norm instead of batch norm we can expect (sometimes more) similar performance to when we are using batch normalization.

And using those kind of normalization does not make the distribution more Gaussian like distribution.

What is a Lipschitz condition?

Video from this website

Reference

  1. What is a Lipschitz condition?. (2018). YouTube. Retrieved 17 November 2018, from https://www.youtube.com/watch?v=Cnc83B3C2pY
  2. (2018). Arxiv.org. Retrieved 17 November 2018, from https://arxiv.org/pdf/1805.11604.pdf
  3. (2018). Proceedings.mlr.press. Retrieved 17 November 2018, from http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
  4. (2018). Arxiv.org. Retrieved 17 November 2018, from https://arxiv.org/pdf/1805.10694.pdf

--

--

Jae Duk Seo

Exploring the intersection of AI, deep learning, and art. Passionate about pushing the boundaries of multi-media production and beyond. #AIArt