In this post, I will discuss details about Gradient Descent and Back-propagation in neural networks, and will help you to understand why there is Vanishing Gradient Problem. In my earlier post (How to implement Gradient Descent in Python), I discussed python implementation of Gradient Descent. In this post, I will show you step-by-step to understand the math behind the Neural Networks.
For the Neural Networks discussed in the post, we will make the following assumptions:
- Sigmoid Function is used as the activation function
- Cross Entropy is used as the error function:
Gradient Descent: The Math
Assume the following scenario, we need to classify
m points into 2 groups, each point has
p features, which is represented by
(X1,X2,...,Xp). Say we are using one layer neural networks to do that.
We feed in point
P1 (with features
X1,X2,...,Xp), after the neuron we get the sum
WX+b; We apply activation function over the sum, and we can get the probability for
P1 in class 1 is
ŷ11=S(WX+b), and probability for
P1 in class 2 is
Up to this point, everything is very straight forward.
The loss function we are using, say, is Cross-Entropy (as shown in the equation above). So our work now is trying to minimize the loss, and all we need to do is:
- Find the derivative of the loss function against weights
- With the derivative we found, we can apply Gradient Descent to update the weights with small steps to move towards the minimum error.
Now let’s start finding the derivatives of the loss function for this point
P1 against weight
Similarly, we can also get the derivatives of the loss function against bias b:
So that we can get the Gradient of error:
Back Propagation: The Math
The Backpropagation logic is describe as below:
- Doing a feedforward operation.
- Comparing the output of the model with the desired output.
- Calculating the error.
- Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.
Now let us try to do Backpropagation on the above two layer neural network. We will try to calculate the weight change steps for:
- Weights between hidden layer and output layer
- Weights between input layer and hidden layer
Let us first try to calculate the gradient descent step for
W21 (The first weight of the layer between hidden layer and output layer):
Similarly, we can find the gradients of the other weights between these two layers:
Now Let us calculate the gradient descent step for weights between input layer and hiddent layer
Similarly we can also get the gradients for the other weights between these two layers, I will not include them here.
However, there are a few important points we need to pay attention to:
- As we can notice, the gradients between the input layer and hidden layer is much smaller than the ones between the top layers. It is the gradients in the top layers times weights times inputs times sigmoid derivatives.
- As we all know, the sigmoid derivatives has a maximum value of 0.25:
- So, this means, the gradients are reduced by at least 75%, this is called Vanishing Gradient Problem
I will prepare another post to discuss Vanishing Gradient in detail, and discuss the different techniques to deal with that.