… one feature may be over emphasized by the network since it’s all about dot products and gradients. It would be better if we were scaling both training and test by training statistics, but this shouldn’t matter much. Think of dividing each pixel by 255, same logic.