# Understand the Math for Neural Networks

## Detailed explanation for Gradient Descent and Back-propagation in math

In this post, I will discuss details about Gradient Descent and Back-propagation in neural networks, and will help you to understand why there is **Vanishing Gradient Problem**. In my earlier post (How to implement Gradient Descent in Python), I discussed python implementation of Gradient Descent. In this post, I will show you step-by-step to understand the math behind the Neural Networks.

# Assumptions

For the Neural Networks discussed in the post, we will make the following assumptions:

**Sigmoid Function**is used as the activation function

**Cross Entropy**is used as the error function:

# Gradient Descent: The Math

Assume the following scenario, we need to classify `m`

points into 2 groups, each point has `p`

features, which is represented by `(X1,X2,...,Xp)`

. Say we are using one layer neural networks to do that.

We feed in point `P1`

(with features `X1,X2,...,Xp`

), after the neuron we get the sum `WX+b`

; We apply activation function over the sum, and we can get the probability for `P1`

in class 1 is `ŷ11=S(WX+b)`

, and probability for `P1`

in class 2 is `ŷ12=1-S(WX+b)`

.

Up to this point, everything is very straight forward.

The loss function we are using, say, is **Cross-Entropy** (as shown in the equation above). So our work now is trying to minimize the loss, and all we need to do is:

- Find the derivative of the loss function against
**weights** - With the derivative we found, we can apply
**Gradient Descent**to update the weights with small steps to move towards the minimum error.

Now let’s start finding the derivatives of the loss function for this point `P1`

against weight `Wj`

:

Similarly, we can also get the derivatives of the loss function against bias b:

So that we can get the Gradient of error:

# Back Propagation: The Math

The Backpropagation logic is describe as below:

- Doing a feedforward operation.
- Comparing the output of the model with the desired output.
- Calculating the error.
- Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.

Now let us try to do Backpropagation on the above two layer neural network. We will try to calculate the weight change steps for:

- Weights between hidden layer and output layer
- Weights between input layer and hidden layer

Let us first try to calculate the gradient descent step for `W21`

(The first weight of the layer between hidden layer and output layer):

Similarly, we can find the gradients of the other weights between these two layers:

Now Let us calculate the gradient descent step for weights between input layer and hiddent layer `W11`

:

Similarly we can also get the gradients for the other weights between these two layers, I will not include them here.

However, there are a few** important points **we need to pay attention to:

- As we can notice, the gradients between the input layer and hidden layer is much smaller than the ones between the top layers. It is the gradients in the top layers times weights times inputs times
**sigmoid derivatives.** - As we all know, the
**sigmoid derivatives**has a maximum value of 0.25:

- So, this means, the gradients are reduced by at least 75%, this is called
**Vanishing Gradient Problem**

I will prepare another post to discuss **Vanishing Gradient **in detail, and discuss the different techniques to deal with that.