Deeper Understanding of Batch Normalization with Interactive Code in Tensorflow [ Manual Back Propagation ]

Jae Duk Seo
May 24, 2018 · 10 min read
GIF from this website

Batch Normalization is a technique to normalize (Standardize) the internal representation of data for faster training. However, I wanted to know more about this method. As I wanted to know the theory behind this idea. And I had couple of questions that I wanted to ask for myself, such as….

Q1) Does Batch Normalization act as a regularizer?
Q2) Benefits of Batch Normalization?
Q3) Draw Backs of Batch Normalization?
Q4) What is Co-variate Shift / Internal Co-variate Shift?
Q5) What is exponentially weighted averages?

Additional I wanted to implement batch normalization (BN) layer to see how the result differ from a model that does not have BN, a model using the tf layers batch normalization and a model trained using AMS Grad. Below is the list of all of the cases that we are going to implement. (Please note the base model is from The All Convolutional Net)

Case a) No Batch Norm with Auto Differentiation Adam
Case b) Batch Norm with Auto Differentiation Adam
Case c) Batch Norm with Manual Back Prop
AMS Grad

Below I have attached the original paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” however I will cite every other sources that I used upon writing this post.

Please note, that this post is for improving my understanding of batch normalization and why it is used in DL.



Standardization / Normalization

Image from this website

Before moving on I will already assume that you have a concrete understanding of the difference between standardization and normalization. If you are not sure please read my blog post about this matter here.


Batch Normalization as Regularization

Image from this website

Drop out is already well known technique in which that can perform regularization to the network. However I never thought of it in the point of view of adding some noise to the network. But as seen above, if we think of drop out as adding a noise vector (that contains numerical value of 0 and 1) we can think of it as adding noise. And it is already known fact that adding noise to the gradient improves accuracy of the model. (Such as presented in this paper Adding Gradient Noise Improves Learning for Very Deep Networks). And I have made a blog post about this, please click here to view the implementation as well as the blog post.

Image from this website

However, we need to take note on one thing, the regularization effect on batch normalization is a side effect, rather than the main objective. Meaning we shouldn’t use it as a method of main regularization. The reason why it acts as a regularization method can be seen below.

Image from this website

Theoretical Benefits of Batch Normalization

Image from this website

This blog post, does an amazing job describing the benefits of batch normalization. It seems like in theory there are multiple of benefits of using batch normalization. Another good post of why batch normalization works can be seen below.

Image from this website

In one sentence summary, it limits the internal co-variate shift by normalizing the data over and over again. (Or standardization, mean of 0 and variance of 1)


Draw Backs of Batch Normalization

Image from Agustinus Kristiadi’s Blog

Agustinus did an amazing job explaining what a batch normalization is as well as provided some additional experiments. At the end, the network with batch normalization gave more higher accuracy however, it took more time to train. This is expected since with batch normalization we have two more parameters to optimize. (Alpah and Beta)

Image from this website

Additionally, this post explains the cautions we have to take when using batch normalization.

Due to exponential moving average if the mini-batch does not properly represent the entire data distribution (Both training/testing data since we are going to use the saved moving exponential average in testing time.) The model’s performance could be heavily decreased.

However with all due respect, I don’t think that would be a problem. If the model was trained on MNIST data it would only make sense to test it on the test image from the same data set. Else the distribution of data will hinder the performance of the model.

But the above post does an amazing job, pointing out the caution when it comes to using batch normalization.


What is Co-variate Shift / Internal Co-variate Shift

Image from this website

This blog post does an amazing job explaining both co-variate shift and internal co-variate shift. I understand it simply as distribution of the data. So if my parameters were trained on distribution A, and we give a data (which have different distribution, lets say B). The trained model will not perform very well.

Image from this website

I understand Internal co-variant shift as the change in the distribution of the data within the inner layers of the network. (Typically we have networks that have more than 1 layer.) However, if anyone wants to read the full detailed description of the term please click here.


What is exponentially weighted averages

Image from this website

Yellow Line → Description of what values of μ and σ that is used on the test set.

One tricky aspect of batch normalization is getting the mean and standardization of the given data. Naturally we want our model’s prediction to only depend on the given test data (during the testing phase).

To make sure that happens we can take the average of mean and the variance values we got during the training phase. And use those values in the testing phase to perform standardization. (I know that my explanation was horrible, luckily Dr. Andrew NG did an amazing job explaining this. And I added two more videos explaining this matter in detail.)

Video from this website
Video from this website
Video from this website

For more information about bias correction, and whether if we need it or not. Please click on this link.


Implementation in Tensorflow

Red Box → Code to Distinguish the training phase from testing phase

Very smart researchers have already done an amazing job explaining how to implement batch normalization layer. So thanks to their contribution it was quite easy to implement in Tensorflow. However, one tricky part was to distinguish the training phase from testing phase. But with a little help from tf.cond() that can be easily implemented.

Please check this blog, this blog, or this blog for REALLY amazing explanation about the implementation.


Result: Case a) No Batch Norm with Auto Differentiation Adam

Left Image → Train Accuracy Over Time / Cost Over Time
Right Image → Test Accuracy Over Time / Cost Over Time

Since the base model (The All Convolutional Net) already performs so well on CIFAR 10 data set it wasn’t surprising to see how the model was able to achieve 88 percent accuracy just in 21th epoch. However we can observe that the model is suffering from over-fitting.


Result: Case b) Batch Norm with Auto Differentiation Adam

Left Image → Train Accuracy Over Time / Cost Over Time
Right Image → Test Accuracy Over Time / Cost Over Time

With tf.layers.batch_normalization the model was able to achieve higher accuracy while achieving lower accuracy on the training data. So in-conclusion batch normalization really does help with generalization as well as faster convergence.


Result: Case c) Batch Norm with Manual Back Prop AMS Grad

Left Image → Train Accuracy Over Time / Cost Over Time
Right Image → Test Accuracy Over Time / Cost Over Time

With AMS Grad paired with batch normalization, the model wasn’t able to achieve accuracy of 88 percent within same number of epoch. However the model does a (pretty) good job on generalization.


Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the log during training.

To access the code for Case a please click here, to access the logs click here.
To access the code for Case b please click here, to access the logs click here.
To access the code for Case c please click here, to access the logs click here.


Final Words

I wanted to make this blog post for so long, since I was just so curious about batch normalization. I am happy that I finally did.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.


Reference

  1. (2018). Arxiv.org. Retrieved 23 May 2018, from https://arxiv.org/pdf/1502.03167.pdf
  2. Data science. (2018). Pinterest. Retrieved 23 May 2018, from https://www.pinterest.ca/pin/463870830351205673/
  3. Understanding Batch Normalization with Examples in Numpy and Tensorflow with Interactive Code. (2018). Towards Data Science. Retrieved 23 May 2018, from https://towardsdatascience.com/understanding-batch-normalization-with-examples-in-numpy-and-tensorflow-with-interactive-code-7f59bb126642
  4. Available at: https://www.quora.com/What-is-the-difference-between-dropout-and-batch-normalization [Accessed 23 May 2018].
  5. (2018). Arxiv.org. Retrieved 23 May 2018, from https://arxiv.org/pdf/1611.03530.pdf
  6. (2018). Arxiv.org. Retrieved 23 May 2018, from https://arxiv.org/pdf/1511.06807.pdf
  7. Only Numpy: Implementing “ADDING GRADIENT NOISE IMPROVES LEARNING FOR VERY DEEP NETWORKS” from…. (2018). Becoming Human: Artificial Intelligence Magazine. Retrieved 23 May 2018, from https://becominghuman.ai/only-numpy-implementing-adding-gradient-noise-improves-learning-for-very-deep-networks-with-adf23067f9f1
  8. (2018). [online] Available at: https://www.quora.com/Is-there-a-theory-for-why-batch-normalization-has-a-regularizing-effect [Accessed 23 May 2018].
  9. Available at: https://www.quora.com/Is-adding-random-noise-to-hidden-layers-considered-a-regularization-What-is-the-difference-between-doing-that-and-adding-dropout-and-batch-normalization [Accessed 23 May 2018].
  10. Implementing BatchNorm in Neural Net — Agustinus Kristiadi’s Blog. (2018). Wiseodd.github.io. Retrieved 23 May 2018, from https://wiseodd.github.io/techblog/2016/07/04/batchnorm/
  11. Glossary of Deep Learning: Batch Normalisation — Deeper Learning — Medium. (2017). Medium. Retrieved 23 May 2018, from https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82
  12. Anon, (2018). [online] Available at: https://www.quora.com/Why-does-batch-normalization-help [Accessed 23 May 2018].
  13. On The Perils of Batch Norm. (2018). Alexirpan.com. Retrieved 23 May 2018, from https://www.alexirpan.com/2017/04/26/perils-batch-norm.html
  14. Learning, D., & Deng, Y. (2017). Understanding Batch Norm. MutouMan. Retrieved 23 May 2018, from http://dengyujun.com/2017/09/30/understanding-batch-norm/
  15. Batch Normalization — What the hey? — Gab41. (2016). Gab41. Retrieved 23 May 2018, from https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b
  16. Exponentially Weighted Averages (C2W2L03). (2018). YouTube. Retrieved 23 May 2018, from https://www.youtube.com/watch?v=lAq96T8FkTw
  17. tf.constant | TensorFlow. (2018). TensorFlow. Retrieved 23 May 2018, from https://www.tensorflow.org/api_docs/python/tf/constant
  18. tensorflow/tensorflow. (2018). GitHub. Retrieved 23 May 2018, from https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/ops/nn_impl.py
  19. TensorFlow?, W. (2018). What is the equivalent of np.std() in TensorFlow?. Stack Overflow. Retrieved 23 May 2018, from https://stackoverflow.com/questions/39354566/what-is-the-equivalent-of-np-std-in-tensorflow/39354802
  20. tf.layers.batch_normalization | TensorFlow. (2018). TensorFlow. Retrieved 24 May 2018, from https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
  21. Bias Correction of Exponentially Weighted Averages (C2W2L05). (2018). YouTube. Retrieved 24 May 2018, from https://www.youtube.com/watch?v=lWzo8CajF5s
  22. Learning?, W. (2018). Why is it important to include a bias correction term for the Adam optimizer for Deep Learning?. Cross Validated. Retrieved 24 May 2018, from https://stats.stackexchange.com/questions/232741/why-is-it-important-to-include-a-bias-correction-term-for-the-adam-optimizer-for
  23. tf.cond | TensorFlow. (2018). TensorFlow. Retrieved 24 May 2018, from https://www.tensorflow.org/api_docs/python/tf/cond
  24. Kratzert, F. (2018). Understanding the backward pass through Batch Normalization Layer. Kratzert.github.io. Retrieved 24 May 2018, from https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
  25. thorey, C. (2016). What does the gradient flowing through batch normalization looks like ?. Cthorey.github.io.. Retrieved 24 May 2018, from http://cthorey.github.io./backpropagation/
  26. Implementing BatchNorm in Neural Net — Agustinus Kristiadi’s Blog. (2018). Wiseodd.github.io. Retrieved 24 May 2018, from https://wiseodd.github.io/techblog/2016/07/04/batchnorm/
  27. trains?, H. (2018). How and why does Batch Normalization use moving averages to track the accuracy of the model as it trains?. Cross Validated. Retrieved 24 May 2018, from https://stats.stackexchange.com/questions/219808/how-and-why-does-batch-normalization-use-moving-averages-to-track-the-accuracy-o
  28. tf.layers.batch_normalization | TensorFlow. (2018). TensorFlow. Retrieved 24 May 2018, from https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
  29. Implementation of Optimization for Deep Learning Highlights in 2017 (feat. Sebastian Ruder). (2018). Medium. Retrieved 24 May 2018, from https://medium.com/@SeoJaeDuk/implementation-of-optimization-for-deep-learning-highlights-in-2017-feat-sebastian-ruder-61e2cbe9b7cb
  30. [ ICLR 2015 ] Striving for Simplicity: The All Convolutional Net with Interactive Code [ Manual…. (2018). Towards Data Science. Retrieved 24 May 2018, from https://towardsdatascience.com/iclr-2015-striving-for-simplicity-the-all-convolutional-net-with-interactive-code-manual-b4976e206760

Jae Duk Seo

Written by

https://jaedukseo.me | | | | |Your everyday Seo, who likes kimchi