Ghost BatchNorm explained

From paper to code

Alvaro Durán Tovar
Deep Learning made easy
4 min readDec 17, 2020

--

Related papers:

Description

This paper aims to solve the issue of the “generalization gap”. It seems neural networks tends to do worse for unseen data when being trained on large batch sizes.

One of the ways proposed to fix this is changing batchnorm layers from calculating statistics (remember BatchNorm layers changes the input data to make it “normally distributed”, 0 mean 1 standard deviation) from using the whole batch, to use small parts of the batch, what we can call nano batches (I just make it up, this term doesn’t appear on the paper).

Why this work?

It isn’t mentioned on the paper why this helps. My intuition is that as we will be changing small parts of the batch independently (and differently for each epoch) it’s like adding noise or maybe doing random data augmentation. We make it harder for the network to learn so it must become smarter!

I wonder if it’s the same as adding different noise to different part of the batch

Indeed it seems to be the case!! I did some tests on this colab notebook https://colab.research.google.com/drive/1cRNltVKTpkO47Wk85AaGPdmAkQmNHpyP#scrollTo=jtVydfmCY-2X and found similar results between GhostBatchNorm and what I’ll call NoisyBatchNorm, a method to add different noise to different slices of the batch.

The algorithm

Now let go for the meat and potatoes. The algorithm from the paper:

Might look bit cryptic, but the idea is simple.

  • Calculate mean of each nano batch.
  • Calculate std of each nano batch.
  • Update running mean using an exponential moving average.
  • Update running std using an exponential moving average.
  • Return whole batch normalized.

Knowing this the following image should be easier to understand:

There is not much to say about mean and standard deviation, it’s what it is. Note the blue lines, I put them to make clear the 𝜖 is not inside the sum.

What does 𝜖 here? It helps with numerical stability. What if for whatever reason the operation under the square root is 0? Everything will fail. Instead adding a small value (like 1e-5) ensure we never have 0. This is one of those little details you must know when implementing papers, but it isn’t mentioned.

The exponential moving average is a simple trick to calculate a moving average using only the last value and the new value:

momentum = 0.1
old * (1- momentum) + new * momentum

Reviewing the algorithm

First of all this paper is pretty cool, I don’t consider myself smarter than the authors! But we all can make mistakes, that’s why there are reviews, etc. You can expect to see errors or things that aren’t well explained on papers (indeed papers have different versions with fixes). I think doing the following changes fix some problems with the original algorithm, missing/incorrect indexes.

Look the red indexes.

Implementation

One naive way to implement this would be by doing everything with loops and that will be very very inefficient. Instead I’m going to show you the vectorized version directly:

And now the super simple & smart way:

What a difference! This last option is pretty cool. It takes every nano batch and calls an embedded batchnorm, instead of re-implenting batchnorm from scratch. This is from pytorch-tabnet repo.

Benchmark time

Which option is faster? Running both versions gives me this:

We can see for ghost batch sizes (< 512) the vectorized version is faster because we aren’t using loops, and as we get closer to the real batch size the second option (calling a batchnorm layer multiple times) is better. This is due to the fact that the pytorch implementation of batchnorm is highly optimized in C.

Conclusions

  • Implementing papers can be pretty hard, even for simple algorithms like this one. Isn’t always easy to explain with math what you can explain quite easily with few words.
  • It’s worth it to know how to write vectorized code, the improvement can be substantial.
  • If you want to use large batch sizes consider using GhostBatchNorm.

--

--