BatchNorm: Fine-Tune your Booster

4 min readJun 8, 2018

In the previous posts, I have explained how Batch Norm works and showed how it can be used in tensorflow. In this post, I’ll tell about best practices, tips and tricks and points to remember to complete this series about Batch Normalization. So let’s get in!

All the examples provided here are also in the github repository, with explanation. You can try those notebooks out.

ilango100/batch_norm

batch_norm - My experiments with Batch Normalization

github.com

In the previous post, in which I provided links to experiment with CNN, if you have run it in default state, the accuracy of the model would not have risen immediately, making BN useless. This is because I didn’t set an important hyperparameter — Momentum of BN. So let’s start with that.

Momentum

As discussed earlier, momentum is the importance given to the previous moving average, when calculating the population average for inference. If you can’t understand what momentum is, It’s nothing but the smoothing that we can adjust in Tensorboard. Momentum is the “lag” in learning mean and variance, so that noise due to mini-batch can be ignored.

Actual(light) and lagged(bold) values with momentum 0.99 and 0.75

By default, momentum would be set a high value about 0.99, meaning high lag and slow learning. When batch sizes are small, the no. of steps run will be more. So high momentum will result in slow but steady learning (more lag) of the moving mean. So, in this case, it is helpful.

But when the batch size is bigger, as I have used, i.e 5K images (out of 50K) in single step, the number of steps is less. Also, the statistics of mini-batch are mostly same as that of the population. At these times, momentum has to be less, so that the mean and variance are updated quickly. Hence a ground rule is that:

Small batch size => High Momentum (0.9–0.99)
Big batch size => Low Momentum (0.6–0.85)

For big batch size, even better scheme is inverse decay momentum, where momentum is very less at starting and goes on increasing as and when training goes on:

Inverse decay momentum for bigger batches

In the colab notebook I have provided, you can now set the momentum as hyperparameter and see how training goes.

Scale and Shift

Scale and shift parameters are applied to normalized values to move the distribution away from around 0. This is mostly useful when BN is used before activation. Currently, BN is widely applied after activation and so, these parameters could be neglected. The tensorflow way to do that would be:

There is no difference when scaling and shifted are used and not used. Note that BN is applied after activation.

The difference would be that, it is not needed to learn 2 extra parameters with each batch_norm layer and hence extra speed up. Also leads to lesser parameters and slimmer models. You can see the changes in accuracy and training time with and without scale and shift with this Colab Notebook.

BN before non-linearity => Use scale, shift
BN after non-linearity => No need of scale, shift

BatchNorm and Dropout

So far we have used only Batch Normalization and saw our model overfit the data. We have also learnt that BN can only provide a weak regularization, which should be used in addition with any other regularization such as Dropout. But how shall we use Dropout and BatchNorm together?

Disharmony between Dropout and Batch Normalization — this paper studied in detail about the problems faced when dropout and BN are used together. The best conclusion drawn is that dropout should be used only after all the BN layers, i.e. at final dense layers. BatchNorm should not be used after a dropout layer.

(a) When BN layers are applied after dropout (b) When dropout layers are applied after all BN

As we can see, when BN is used after dropout, the accuracy is not stable. Dropout should be used at the end of the network, where BN cannot come after that.

Use dropout after all BN layers

This Colab Notebook provides you framework to check how batchnorm performs after dropout and in ideal way.

Batch Renormalization

Batch Renormalization is another technique used when batch size is very small like 32, 64 etc. There taking mini-batch mean and variance may be too error-prone and hence moving statistics are taken for training itself. This leads to more accurate training. But then you need to set another set of hyperparameters like momentum for this moving statistics.

To enable batch renorm in Tensorflow, set renorm parameter to True and also set the renorm_momentum:

Tiny batch size => Use Batch Renormalization

Update moving statistics

One of the important things people tend to forget when using Batch Normalization is that they forget to update the moving average when training. So at testing time, the inference values for mean and variance are missing, ending up in low accuracy.

In tensorflow, add update ops as dependency to training step:

To summarize:

Set momentum accordingly.
Use scale and shift only when needed.
Use dropout correctly: Dropout to be applied after all the BN layers.
When batch size is very small, use batch renormalization.
Don’t forget to update population statistics while training.

This wraps up the series about Batch Normalization. If you have any queries or suggestions, feel free to comment.