Deep Dive into Back Propagation (Part II)

Aung Kyaw Myint
Analytics Vidhya
Published in
3 min readDec 18, 2019

This is the Part II of the “Deep Dive into Back Propagation”. To continue reading, I suggest to check out Part I first if you haven’t read it. Link for Part I can be found here.

We will now continue with an example focusing on the backpropagation process, and consider a network having two inputs [x1,x2], three neurons in a single hidden layer [h1,h2,h3] and a single output y.

The weight matrices to update are W1 from the input to the hidden layer, and W2 from the hidden layer to the output. Notice that in our case W2 is a vector, not a matrix, as we only have one output.

We will begin with feed forward pass of the inputs across the network, then calculate the output, based on that error, use back propagation to calculate the partial derivatives.

Calculating the values of the activations at each of the three hidden neurons is simple. We have a linear combination of the inputs with a corresponding weight element of the matrix W_1. All that is followed by the activation function.

The outputs are a dot product of the activations of the previous layer vector H with a weights of W2.

After computing the output, we can finally find the network error.

As a reminder, the two Error functions most commonly used are the Mean Squared Error (MSE) (usually used in regression problems) and the cross entropy (often used in classification problems).

In this example, we use a variation of the MSE:

where d is the desired output and y is the calculated one. Notice that y and d are not vectors in this case, as we have a single output.

The error is their squared difference, and is also called the network’s Loss Function. We are dividing the error term by 2 to simplify notation, as will become clear soon.

The aim of the backpropagation process is to minimize the error, which in our case is the Loss Function. To do that we need to calculate its partial derivative with respect to all of the weights.

Since we just found the output y, we can now minimize the error by finding the updated values ΔWkij​. The superscript k indicates that we need to update each and every layer k.

As we noted before, the weight update value ΔWkij is calculated with the use of the gradient the following way:

(Notice that d is a constant value, so it’s partial derivative is simply a zero)

This partial derivative of the output with respect to each weight, defines the gradient and is often denoted by the Greek letter δ.

We will find all the elements of the gradient using the chain rule.

Recap on Chain Rule

Link for Part III can be found here.

Content Credit: Udacity Deep Learning Program

--

--