Deep Dive into Back Propagation (Part II)

Published in

Analytics Vidhya

3 min readDec 18, 2019

This is the Part II of the “Deep Dive into Back Propagation”. To continue reading, I suggest to check out Part I first if you haven’t read it. Link for Part I can be found here.

We will now continue with an example focusing on the backpropagation process, and consider a network having two inputs [x1,x2], three neurons in a single hidden layer [h1,h2,h3] and a single output y.

The weight matrices to update are W1 from the input to the hidden layer, and W2 from the hidden layer to the output. Notice that in our case W2 is a vector, not a matrix, as we only have one output.

We will begin with feed forward pass of the inputs across the network, then calculate the output, based on that error, use back propagation to calculate the partial derivatives.

Calculating the values of the activations at each of the three hidden neurons is simple. We have a linear combination of the inputs with a corresponding weight element of the matrix W_1. All that is followed by the activation function.

The outputs are a dot product of the activations of the previous layer vector H with a weights of W2.

After computing the output, we can finally find the network error.

As a reminder, the two Error functions most commonly used are the Mean Squared Error (MSE) (usually used in regression problems) and the cross entropy (often used in classification problems).

In this example, we use a variation of the MSE:

where d is the desired output and y is the calculated one. Notice that y and d are not vectors in this case, as we have a single output.

The error is their squared difference, and is also called the network’s Loss Function. We are dividing the error term by 2 to simplify notation, as will become clear soon.

The aim of the backpropagation process is to minimize the error, which in our case is the Loss Function. To do that we need to calculate its partial derivative with respect to all of the weights.

Since we just found the output y, we can now minimize the error by finding the updated values ΔWkij. The superscript k indicates that we need to update each and every layer k.

As we noted before, the weight update value ΔWkij is calculated with the use of the gradient the following way: