Batch Normalization and ReLU for solving Vanishing Gradients
A logical and sequential roadmap to understanding the advanced concepts in training deep neural networks.
Agenda
We will break our discussion into 4 logical parts that build upon each other. For the best reading experience, please go through them sequentially:
1. What is Vanishing Gradient? Why is it a problem? Why does it happen?
2. What is Batch Normalization? How does it help in Vanishing Gradient?
3. How does ReLU help in Vanishing Gradient?
4. Batch Normalization for Internal Covariate Shift
Vanishing Gradient
1.1 What is vanishing gradient?
First, let’s understand what vanishing means:
Vanishing means that it goes towards 0 but will never really be 0.
Vanishing gradient refers to the fact that in deep neural networks, the backpropagated error signal (gradient) typically decreases exponentially as a function of the distance from the last layer.
In other words, the useful gradient information from the end of the network fails to reach the beginning of the network.