A High-Level Overview of Batch Normalization

Jason Jewik
Apr 23, 2020 · 6 min read

Based on this paper: https://arxiv.org/pdf/1502.03167.pdf

Written by Maya Raman and Jason Jewik, as part of a series by UCLA ACM AI. Check out these other articles in the series written by our fellow officers:

📌 Note: if you see an asterisk next to an italicized phrase, that means more info can be found in the appendix at the bottom of the article.

😕 The Problem

Key Ideas

  • Requires lower learning rates and careful parameter initialization, resulting in slower training
  • Makes it hard to train models with saturating nonlinearities*
Three matroyshka, or nesting, dolls in a row.
Three matroyshka, or nesting, dolls in a row.
Photo by Blake Weyland on Unsplash

Matryoshka dolls are nested figurines, which get smaller and less ornate as you remove each layer. We can imagine a deep neural network like these dolls. That is to say, let’s think of the second through Nth layers as a smaller, less complex neural network nested inside the outermost one. Then the third through Nth layers would be another neural network within the former! And so on. With this view in mind, we can form an intuitive understanding of the problem.

Picture a deep neural network in training as a set of matryoshka dolls. The data starts at the “outermost network (doll)”, where changes caused by the layer parameters could potentially result in a different distribution of data. This is the key dilemma! Each nested network applies its own changes to the data, which can create a wildly different distribution of inputs by the time data reaches the “innermost network”. This makes the training process very inefficient because the nested networks may not have matching distributions of input and test data.

💡 The Solution(s)

Key Ideas

  • Reduces dependence of gradients on the scale of the parameters or of their initial values
  • Makes it possible to use saturating nonlinearities
  • Allows for higher learning rates without risk of divergence
  • Reduces the need for dropout

Whitening the Inputs

Normalization via Mini-Batch Statistics

  1. Instead of whitening the features in layer inputs and outputs jointly, they normalized each scalar feature independently, taking care to ensure that the transformation inserted in the network can represent the identity transform.
  2. Since mini-batches are used in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation. Thus, the statistics used for normalization can fully participate in the gradient backpropagation.

Through some calculus that won’t be covered here, the authors then prove that the batch normalization transform is a differentiable transformation that introduces normalized activations into the network.

What this means is that as the model trains, input distributions experience less internal covariate shift, allowing the “nested networks” (and thus the entire network) to avoid the problems outlined in the previous section, thus increasing the learning efficiency.

Higher Learning Rates

  1. Exploding/vanishing gradients
  2. Getting stuck in local (rather than global) minima

Batch Normalization addresses these issues because normalizing activations means that:

  1. Backpropagation through a layer is unaffected by the scale of its parameters
  2. Changes in the parameters of one layer will not have an outsized effect on the next layer’s ability to learn its parameters

To further understand this, let’s use an analogy: picture a snowball gaining mass as it rolls down a hill. Without normalization, data just “snowballs” its way through a network, meaning that later layers may have to learn with really large (or really small) input values. With normalization, the output of a layer is mapped to fit within a certain range before being handed off as input to the next layer (in our analogy, maybe it’s an exceptionally sunny day, and the snowball is melting at the same rate it accumulates more snow).

Regularizing the Model

Note: The authors of the paper note that the exact effects of Batch Normalization on gradient propagation remains an area of further study. For instance, the recommendation to not use dropout is simply based on their own observations. Many other machine learning algorithms also rest atop empirical evidence, sometimes more so than theory. ¯\_(ツ)_/¯

Accelerating Batch Normalization Networks

  • Increase learning rate
  • Remove Dropout
  • Reduce the L2 weight regularization*
  • Accelerate the learning rate decay (i.e., how quickly the learning rate decreases in value)
  • Remove local response normalization*
  • Shuffle training examples more thoroughly
  • Reduce distortions in the input data

💭 Appendix

🧽 What does “saturating nonlinearities” mean?

Saturation arithmetic is a version of arithmetic in which all operations such as addition and multiplication are limited to a fixed range between a minimum and maximum value.

Intuitively, something that is saturated — like a fully soaked sponge — cannot have any more added to it. Applying that analogy, a saturating function stops growing/shrinking as its inputs approach positive/negative infinity.

Nonlinearities refers to the activation functions we apply to the outputs of our neural network’s layers. Putting the two ideas of saturation arithmetic and nonlinear activation functions together, let’s look at some examples:

  • The Rectified Linear Unit (ReLU) function is non-saturating because as x → ∞, f(x) → ∞. Using the sponge imagery, it would be like a sponge that can soak up an infinite amount of water — not very realistic.
A graph of the ReLU function.
  • The sigmoid function is saturating because as x → ∞, f(x) → 1 and as as x → -∞, f(x) → 0. This would behave like an actual sponge — you can keep adding more water, but once it is completely soaked, it is unable to hold more.
A graph of the sigmoid function.

📈 What does “zero means and unit variances” mean?

For example, let’s say we have a black-and-white image that is 300 by 300 pixels, and each pixel can take on values in the range [0, 255] (e.g., pixel 1 is 128, pixel 2 is 60, pixel 3 is 207, …).

  1. To get a mean of zero, we will have to apply a negative shift such that the average value of all the pixels is zero. (This means that some values will become negative!)
  2. Then to get a unit variance, we will need to squash the new values to fit into the range [-1, 1].

Empirically, it has been found that gradient descent converges much faster with feature scaling that without it. So, machine learning engineers/scientists often pre-process data before handing it to a neural network.

High-level descriptions of some methods used to do that can be found on the Wikipedia page.

🏋️‍♀️ What is “L2 weight regularization”?

Additional information can be found at this Towards Data Science article.

🔲 What is “Local Response Normalization”?

Additional information can be found at this Towards Data Science article.

The Startup

Get smarter at building your thing. Join The Startup’s +800K followers.

Thanks to Maya

Jason Jewik

Written by

CS @ UCLA. Interested in AI and using it for mitigating and adapting to the climate crisis. 🤖🌱 He/him.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.

Jason Jewik

Written by

CS @ UCLA. Interested in AI and using it for mitigating and adapting to the climate crisis. 🤖🌱 He/him.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store