Batch Normalization made fast
As proven in many modern Deep Learning architectures, Batch Normalization greatly helps avoiding overfitting, and thus speeds up training while achieving high accuracy. By normalizing data (subtracting mean and dividing by standard deviation) BN reduces the range of the data. This operation is commonly set between Fully Connected (FC) or Convolutional (Conv) layers and the subsequent activation function, packing the values at the entrance of the activation function around the non-linearity.
If you are willing to dive into some math to greatly speed up BN during inference/classification of your Neural Network, please read ahead. This article assumes you have some knowledge on certain modern Neural Newtorks, including Convolutional Neural Networks (CNN) and Batch Normalization algorithm, originally proposed in:
ORIGINAL BATCH NORMALIZATION
If seeking to deepen your understanding on BN, I recommend the following article:
The original batch normalization function, after training (this is, the function that gets applied during inference), is defined as:

where:
- Mu is the mean of the input data, trained over batches of data following the procedure described in the original paper (above).
- Sigma2 is the variance of the input data, also trained over batches.
- Gamma is the scaling parameter, trained normally using backpropagation.
- Beta is the shifting parameter, also trained normally with backpropagation.
Let’s go over a quick example of how it works: given an input distribution with values ranging from -90 to 50 as the one depicted below,

, the inclusion of a Batch Normalization layer would greatly reduce the input data range, in our case we obtained outputs ranging from -4.5 to 3:


The main interest of setting a Batch Normalization layer right before an activation function is to smartly concentrate the values around the non-linearity that activation functions present around x=0, as shown in the image on the left depictingthe most typical activation functions sigmoid, hyperbolic tangent and Rectified Linear Unit. Indeed, one can intuitively how the data distribution after applying BN will me much more affected by the non-linearities.
SPEEDING UP BATCH NORMALIZATION DURING INFERENCE
Once you have your trained Neural Network (NN) with one or several Batch Normalization operations inside it, and before deploying the NN to perform inference/classification, you can greatly speed up BN computations by applying two smart mathematical transformations to your NN parameters: reformulation and reparametrization.
Reformulation involves aggregation of the parameters inside the Batch Normalization, smartly reducing from 4 parameters (mu, sigma2, gamma, beta) to just two, while also getting rid of some of the operations (square, square root, division). The new parameters, arbitrarily named nu and tao, redefine BN into just one multiplication and one addition:

This transformation alone already boosted 9% our total inference time when tested in a network using several BN.
Reparametrization goes even beyond. Provided that a BN is located right before either a Fully Connected (FC) or a Convolutional (Conv) layer in your network, the FC/Conv layer can completely absorb all the operations related to the BN. For this, we transform the Weights (W) and Biases(b) of the FC/Conv layer the following way:

The FC layer would then be:

Similarly, the Conv layer would be:
This completely eliminates the BN operations. In our test network, it reduced inference time in over 20%!
REFORMULATION AND REPARAMETRIZATION TOGETHER
Putting both transformations together in a comprehensive flow diagram:

After training/importing a trained NN, we would apply the reparametrization in those BN that allow it (those preceded by either a FC or a Conv layer), then apply reformulation to the rest, and finally we could deploy our NN and run more efficient inference/classification with it.
WHAT IS THE CATCH?
Basically none. The operations are mathematically equivalent. If you don’t believe this, feel free to check our demo project in the following github repo, where we use Tensorflow to train a small CNN to classify MNIST digits, and we measure the inference time using the original BN and using our reparametrized and reformulated layers.
So, yeah, you get a 10–20% speed increase in NN inference using several BN at absolutely zero cost in accuracy. This boost depends on the number of BN in your NN, but it is always greater than zero, and it never impacts accuracy.
This post is based on “FHE compatible Batch Normalization for Privacy Preserving Deep Learning” by Alberto Ibarrondo and Melek Onen (below!). If you find it interesting, pleas share!