Member-only story
Batch Normalization Explained in Plain English
Theory and implementation in Keras
In neural networks, getting input feature values into the same range will result in the following benefits.
Faster convergence and speed up training
After normalizing, the values lie in a very small range. So, the values are now small and homogeneous (have the same range). This makes learning (converging) faster for neural networks!
Before normalizing data, the cost (loss) function has a non-symmetric shape. But after normalizing, the cost function becomes more symmetric, which is much easier to converge because the gradient descent algorithm goes straight towards the minimum error faster rather than oscillating between the minimum and maximum error, which needs more time to converge.
Reduce the vanishing gradient problem
This problem may happen when there are many layers in the neural network. During training, layer weights are often multiplied by very small numbers less than 1. When this happens in many layers, the resulting outputs become even smaller so that, at some point, computer hardware cannot represent these small values anymore. So, the vanishing gradient problem arises.