Internal Covariate Shift: How Batch Normalization can speed up Neural Network Training

Jamie Dowat
Analytics Vidhya
Published in
6 min readMar 29, 2021
Source: Understanding the Structure of Neural Networks (if you need a refresher on the inner workings of neural networks, check this out!!

When beginning to build your first neural network, the process seems akin to feeling around for a needle in a haystack, blindfolded. There are a million hyper-parameters — learning rate, number of layers, number of nodes, batch size, activation function — and it seems that finding the optimal parameter is a stab at the dark, at best.

Today, we’re going to hopefully shed some light on this “black box” with the help of Sergey Ioffe and Christian Szegedy’s paper, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, to take a closer look at at least one of these ambiguous hyper-parameters — batch size.

A Quick Overview

In 2015, Ioffe & Szegedy published this paper proposing a Neural Network training strategy that, after thorough experimentation, was shown to:

  • Substantially decrease training time
  • Remove necessity of drop-out
  • Decrease amount of regularization needed
  • Allow for increased learning rate

They have termed this strategy as Batch Normalization.

Using the MNIST dataset, they experimented with a network with a large number of convolutional and pooling layers, and a softmax layer to predict the image class.

They found that their batch-normalized dataset was able to achieve the same accuracy as the control model (4.82% test error) with 14 fewer training steps.

Considering the computational cost of a neural network anywhere near the realm of MNIST digit classification, any strides that can be taken to help the network learn faster can have real positive implications for the deep learning community.

If you’d like to take a glance at the paper yourself, you can refer to the dictionary at the bottom of this post to keep track of some of the self-contained terms they use throughout their analysis.

So, what is Internal Covariate Shift??

Ioffe & Szegedy’s definition is as follows:

Internal Covariate Shift is the change in the distribution of network activations due to the change in network parameters during training.”

The deeper your network, the more tangled of a mess internal covariate shift can cause. Let’s remember that Neural Networks learn and adjust their weights through a mathematical game of telephone (the more people, or ‘layers’ put in the chain, the more messed up the message is going to get). As builders of neural networks, our job is to stabilize and improve the connection between our output layer’s results, and each hidden layers’ nodes.

Our authors reason that, if we stabilize the input values for each layer (defined as z = Wx + b, where z is the linear transformation of the W weights/parameters and the biases), we can prevent our activation function from putting our input values into the max/minimum values of our activation function. To illustrate this concept, they highlighted the Sigmoid Activation Function, shown below.

Looking at this graph, we can see that the larger z is, the function approaches what they term the “Saturated regime” (or “area”) of the function.

For a great article explaining Saturation, click here.

Why do we want to stay out of the “saturated regime”?

Well, let’s take a look at our activation function’s derivative:

For a review about how neural network gradient calculation works, check out 3Blue1Brown’s amazing video here.

As z increases, the derivative quickly goes down to ZERO. Why does this matter? Well, if we’ve still got a huge value from our cost function we need to minimize, but we’re only going to change our weights based on the activation’s derivative (gradient), our “steps” toward convergence are gonna get real tiny.

This continual inflation of our activation function’s outputs (also known as “Non-linearities”) can be prevent with, you guessed it, Batch Normalization.

How does Batch Normalization work?

We already know that scaling our data (mean of 0, standard deviation of 1) is essential for any model we’re building. Ioffe & Szegedy want to not only make sure data is scaled before entering training, but continues to stay scaled while it’s training.

But, how is this accomplished, exactly?

Ioffe & Szegedy, page 4

After some trial and error (see pages 2 and 3), they found that the network inputs could be stabilized with a Batch Normalizing Transform. Take a look at the image on the left; the new transformed value is signified as y, which represent the transform of x (aka the “non-linearity input values”) with two unfamiliar looking parameters, γ and β.

These parameters serve as a “Standardizer”, learned throughout training, to shift the input values (signified as x-hat), which were normalized.

In their words…

Any layer that previously received x as the input, now receives BN(x).

The inputs (x) were normalized using the mini-batch’s mean and variance.

Important Note here: After training the network with this extra Batch Normalization (BN) step, the entire training set’s mean and variance will be used.

Let’s look at some Graphs!

Ioffe & Szegedy, page 7

Inception is the term for their control model (no Batch Normalization). The above graph graphs the number of training steps each model took before reaching convergence. It becomes painfully clear here how Batch Normalization substantially reduces training time. Besides Inception, the other lines each show a slightly adjusted version of a Batch Normalized network:

  • BN-Baseline: Same learning rate as Inception.
  • BN-x5: Initial learning rate of 0.0075 (5 times Inception’s learning rate).
  • BN-x30: Initial learning rate 0.045 (30 times that of Inception).
  • BN-x5-Sigmoid: Uses Sigmoid activation function (non-linearity) instead of ReLU.

We see that BN-x5 stands as the winner, needing but a tiny fraction (6.7%, to be exact) of the training steps of Inception to achieve an accuracy of 73%, while poor non-normalized Inception needed almost 15x the amount of steps to get an accuracy of 72.2%.

To get their impressive error rate of 4.82% that I mentioned at the beginning they used ensemble classification, made up of 6 networks based on BN-x30, with some modified hyper-parameters (see page 7).

Another Important Note: To use Batch Normalization with a Convolutional Neural Network, normalization had to occur at a slightly different point to honor the “convolutional property” (page 4: different elements of the same feature map, at different locations, will be normalized in the same way):

“To achieve this, we jointly normalize all the activations in a mini-batch, over all locations.” (page 4)

Takeaway time…

In much fewer words, while you’re trying to string together your neural network, you may want to try some Batch Normalization. Check out this great article here that provides further resources on the subject. Check out this post for a tutorial on how to implement BN with keras. Happy Networking!

A Short lil’ Dictionary!

  • Batch Normalization: A transformation given to a network’s hidden layer inputs.
  • Non-linearity (noun): A given activation function (ex: Sigmoid non-linearity == Sigmoid activation function)
  • Saturated Regime: where most of the hidden nodes have values close to -1.0 or +1.0 and the output nodes have values close to 0.0 or 1.0. The pre-activation sum-of-products is relatively large. When your network’s nodes exist in this space, training slows down significantly, since gradient values decrease.
  • Whitening: decorrelating the individual components of a random vector and making the variance 1
  • Jacobian: a function from n equations in n variables whose value at any point is the n x n determinant of the partial derivatives of those equations evaluated at that point.
  • Learned Affine Transform: a transformation that preserves collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances.

--

--

Jamie Dowat
Analytics Vidhya

Performed to my heart’s content for a year at music theater school (thank you Viterbo University) — dropped out and am now making lots of graphs.