Batch Normalization —With TensorFlow

In the previous post, I introduced Batch Normalization and hoped it gave a rough understanding about BN. Here we shall see how BN can be incorporated in your models. I am using Tensorflow as my platform.

Tensorflow offers lot of flexibility and ease of use. It provides both High level and Low level APIs. The Estimator API provides super-fast way to create, train and test a model. Tensorflow also provides TensorBoard, an interactive visualization tool. It helps view the plots or model graph.

Dataset

The dataset I have chosen to work with is CIFAR 10, which is a famous dataset for image recognition. The images are of size 32x32, with 10 categories for classification. It has 50,000 images to train and 10,000 images to test. A simple neural network with 3 conv layers and 2 dense layers is mostly enough to learn the dataset.

Convolution layer

Google Colaboratory

Colaboratory is a Google research project created to help disseminate machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud. It even provides a GPU accelerator to speed up training. And all of it for free, if you have a Google Account.

The model done here is available as Colab notebook here. You can experiment with the notebook as you read through this post. On opening, you need to save it to your google drive first. Then, Go to Edit -> Notebook Settings -> Select Accelerator -> GPU, so that training happens faster.

If you want to run the notebook in your local system, clone this Github repo:

Define the Model

The model is first defined as a generic convnet specification:

In lines 43–46, this is where we tell Tensorflow to update the moving average of mean and variance, at training time. It is important to include this, which many tend to forget. Make sure you include to update moving averages during training.

Since this model is generic, it builds a Convolutional neural network, given the hyper-parameters as params. For my model, I have given the hyper-parameters as:

hparams = {
"filters": [30, 50, 60],
"kern": [[3, 3]]*3,
"strides": [[2, 2], [1, 1], [1, 1]],
"dense": [3500, 700],
"n_classes": 10,
"with_bn": True
}

You can tweak the parameters such as filters, kernel, strides or even number of dense layers and units after conv layers. Just experiment with the values and see what you can find. Also, ‘with_bn’ can be changed to include BatchNorm or not. First we disable it and train the dataset. Then we enable BatchNorm and train to compare the results.

BatchNorm before or after Activation

In the previous post, we saw that BatchNorm can be applied before or after non-linearity, which is still a question of debate. In the current model, I have decided to use BatchNorm after Activation, and more specifically, before the input of each layer, since BatchNorm was introduced to reduce the internal covariate shift. As to use whether scaling and shifting, I decided to use them for now.

Pooling Layers

Pooling Layers are mostly used in convolutional networks when the size of the images are large and need to be scaled down. It also provides translational and rotational invariance, i.e, provides same output if the region of feature map is rotated. But in most cases, size reduction can be achieved by increasing stride of conv layer. This paper studied such models and shows that pooling layer is not necessary at all.

Some image recognition models use pooling, and some use only conv layer with stride greater than 1. In my model, I have decided not to use pooling layer. Instead, I used stride of 2x2 in the first layer to reduce resolution and consecutive layers stride of 1x1.

Let’s start the Train!

I highly recommend using Google Colab with GPU accelerator for training, unless you have a powerful GPU in your system. Let us start training the model. Since Estimator API is used, the learnt parameters as saved automatically while training. The training graph can be viewed in TensorBoard.

The version without BN is run for 400 steps and with BN is run only until 100 steps. For each 2 training epochs, a testing run is done. The results can be seen here:

Loss without BatchNorm
Loss with BatchNorm

As we can see, what the model without BN achieves in 350 steps, the model with BN achieves in just 70! Training with BN is faster by factor of 5. Batch Normalization is working after all!

With this model only 65 to 70% accuracy can be achieved. Many models employ data augmentation, to improve accuracy, which I will explain later in another post.

Overfitting

As we can see in the plots, testing losses (blue) goes up during training — an obvious indication of overfitting. For now, I have concentrated only to know how the BN fastens up the regular networks. So, I didn’t use any regularization.

BN also provides a weak form of regularization. But since here it cannot prevent overfitting, it should be used with any other regularization layers like dropout.

But how to use dropout? Like activation, should it be used before or after BN? And how would the current model perform if it is applied before activation or when scaling and shifting parameters are neglected? We shall find out in the next post.


Do try this out with the Colab Notebook and comment about any difficulties you faced or things you want explanation about.