Paper Review of “Batch Normalization”
Problem Statement:
A proposed solution of the internal covariate shift by utilizing process of batch normalization
Current Approach:
Ioffe and Szegedy published this paper proposing a neural network training strategy that after through experimentation was shown to substantially decrease training time, remove necessity of drop-out, decrease amount of regularization required and allow for increased learning rate. They termed this strategy as Batch Normalization.
The experiment was performed in a convolution neural network with large number of convolution and pooling layers and a SoftMax layer to predict the image class.
Batch normalized dataset was able to achieve the same accuracy as the control model with 14 fewer training steps.
Experimentation Setup:
They used the popular MNIST dataset for their experimentation.
Input values for each layer are stabilized to z = W*x +b, where z is linear transformation of the weight parameter and biases. With this setup in the inputs, we can prevent our activation function from putting our input values into the max/minimum values of activation function such as ReLU of sigmoid. The loss function used is the stochastic gradient descent (SGD) which optimizes the parameter theta (Θ) of the network, as
The performance evaluation criteria were the speed of convergence of the neural network on the number of training steps, illustrated below
BN-baseline: same learning rate as inception
BN-x5: initial learning rate of 0.0075(5 times inception’s learning rate)
BN-x30: initial learning rate 0.045(30 times that of inception)
BN-x5-sigmoid: uses sigmoid activation function instead of ReLU
Result:
We see that BN-x5 stands as the winner, needing but a tiny fraction (6.7%, to be exact) of the training steps of Inception to achieve an accuracy of 73%, while poor non-normalized Inception needed almost 15x the number of steps to get an accuracy of 72.2%.. To get their impressive error rate of 4.82%, they used ensemble classification, made up of 6 networks based on BN-x30, with some modified hyper-parameters.
Conclusion:
Key takeaways:
- Batch normalization is a technique to standardize the inputs to a network, applied to either the activations of prior layer or inputs directly
- Batch normalization accelerates training, in some cases by halving the epochs or better, and provides some regularization, reducing generalization error
- Batch normalization allows the learning rate to be increased without harming the network training and hence speeds up the training of neural networks.
Questions unanswered:
- During batch normalization is the mini-batch gone through twice, one to calculate mean and variance and then again to normalize them?
- What can be an example other than batch normalization that uses statistics of batches?
Limitations of current paper:
The paper talks about batch normalization and its benefits and how it can improve the neural networks by comparing various variants, but the paper doesn’t talk about the limitations and overheads that batch normalizations can have.
Suggestion on improving:
Talk about the costs of batch normalizations such as the computation and memory overheads, BN is very memory intense task as all the batch statistics must be stored in the layer and also the computation cost is also expensive depending on the network. Normalization may also limit the model capability as it forces zero mean.
For more detailed explanation, the Paper can be reffered at: https://arxiv.org/abs/1502.03167