Batch Normalization NOTES

moshe roth
Jul 15, 2017 · 3 min read


This is my summary for the lesson on Udacity Deep Learning Nanodegree — can find their public notebooks here. It’s a part of the section about Deep Convolutional GANs.

Normalize the inputs to the network (the entire dataset should be “scaled” to the same ruler). And, it’s done in the preprocessing stage of your work.

In 2015, additional approach was introduced to decrease training time — Batch Normalization. During training, you send batches through your network, with the outcome on the other end of the network you can optimize the parameters comprising the network (with Stochastic Gradient Descent and other optimizers).

Batch Normalization is another layer to add to the deep network. On the input side you receive whatever data you want. On the output side you’ll get the same data but normalized to mean of 0 and variance of 1. And if you look on a deep network in subsets of smaller deep networks, every input to a deep network subset is normalized.


From the different benefits of BN what is most intriguing for me is Provides a bit of regularization. You can do less dropout on your network because the BN adds “noise” and adds robustness. This is because, during training every data point goes through the network with different data points in different batches — and now with BN, it will look at the same data point with different normalized “glasses” each batch.

The time and memory optimization are obvious; in the Udacity notebook you can see that the actual network we create has less parameters — because you don’t need the biases due to the normalization. “because the batch normalization already has terms for scaling and shifting”. And thus, the time it takes to the network to converge is shorter. Putting it before the activation layer is considered best practice and not mandatory.

With BN the network is more robust. This plot shows a scenario where a big learning rate (=1) crashes a network without BN, while the BN network is stable.

Won’t fix everything

If you have problems with your network (Udacity used ‘bad weights initialization’ example), BN will probably won’t help… They showed a scenario that the network didn’t converge at all

I like how they concluded the paragraph

The examples in this notebook are meant to show that batch normalization can help your networks train better. But these last two examples should remind you that you still want to try to use good network design choices and reasonable starting weights.

How to implement With and Without?

The Udacity example of not using/using BN in a ConvNet (watch the notebook) really elaborates a. how to use b. what’s the difference in the results.

With no BN we create a Feed Forward layer and a bunch of Convolutional layers like this:

def fully_connected(prev_layer, num_units):
layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
return layer
def conv_layer(prev_layer, layer_depth):
strides = 2 if layer_depth % 3 == 0 else 1
conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', activation=tf.nn.relu)
return conv_layer

nothing special there. Now, with BN it looks like this:

def fully_connected(prev_layer, num_units, is_training):
layer = tf.layers.dense(prev_layer, num_units, use_bias=False, activation=None)
layer = tf.layers.batch_normalization(layer, training=is_training)
layer = tf.nn.relu(layer)
return layer
def conv_layer(prev_layer, layer_depth, is_training):
strides = 2 if layer_depth % 3 == 0 else 1
conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', use_bias=False, activation=None)
conv_layer = tf.layers.batch_normalization(conv_layer, training=is_training)
conv_layer = tf.nn.relu(conv_layer)

return conv_layer

Essentially, the difference is to remove the Activation and Bias from the first entry layers (tf.layers.dense and tf.layers.conv2d) with `use_bias=False, activation=None` and than attach it to a BN layer with `tf.layers.batch_normalization` and finally to a relu Activation layer.

moshe roth

Written by

I love technology and people. I’m extremely fond of products that disrupt markets and use data to do so

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade