The Vanishing Gradient Problem
Vanishing Gradient Problem occurs when we try to train a Neural Network model using Gradient based optimization techniques.
Vanishing Gradient Problem was actually a major problem 10 years back to train a Deep neural Network Model due to the long training process and the degraded accuracy of the Model.
What happens is that as we keep on adding more and more Hidden layers in The model , the learning speed of the next hidden layers in the model keep on getting faster and faster.
Generally, adding more hidden layers tends to make the network able to learn more complex arbitrary functions, and thus do a better job in predicting future outcomes. This is where Deep Learning is making a big difference due to the thousands and millions of hidden layers it has , we can now make sense of highly complicated data such as images , speeches , videos etc and do Speech Recognition and Image Classification , Image Captioning etc.
Now coming to the point- What is the Vanishing Gradient Problem?
Now when we do Back-propagation i.e moving backward in the Network and calculating gradients of loss(Error) with respect to the weights , the gradients tends to get smaller and smaller as we keep on moving backward in the Network. This means that the neurons in the Earlier layers learn very slowly as compared to the neurons in the later layers in the Hierarchy. The Earlier layers in the network are slowest to train.
Why Earlier layers of the Network are so important to us?
Earlier layers in the Network are important because they are responsible to learn and detecting the simple patterns and are actually the building blocks of our Network. Obviously , if they give improper and inaccurate results , then how can we expect the next layers and the complete Network to perform nicely and produce accurate results.
Now what harm does it do to our Model ?
The Training process takes too long and the Prediction Accuracy of the Model will decrease.
Hence This is all Vanishing Gradient problem does to our Neural Network Model. Just think of a Deep Neural Network Model which is highly complicated and has millions of layers in it, how problematic it can be to train such a deep Network and produce good and accurate results.
This is the reason why we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems — as mentioned in my article. Hence mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy .
Hope you get the answer to your Question.
Here is an amazing video which explains it all about Vanishing Gradient Problem. Do watch it.