Deep Dive into Back Propagation (III)

Aung Kyaw Myint
Analytics Vidhya
Published in
5 min readDec 24, 2019

This is the Part III of the “Deep Dive into Back Propagation”. To continue reading, I suggest to check out Part I and Part II if you haven’t read it. Link for Part I and Part II can be found here and here.

Now that we understand the chain rule, we can continue with our backpropagation example, where we will calculate the gradient.

In our example we only have one hidden layer, so our backpropagation process will consist of two steps:

Step 1: Calculating the gradient with respect to the weight vector W² (from the output to the hidden layer).
Step 2: Calculating the gradient with respect to the weight matrix W¹ (from the hidden layer to the input).

Step 1 (Note that the weight vector referenced here will be W². All indices referring to W² have been omitted from the calculations to keep the notation simple).

Equation 1

As you may recall:

In this specific step, since the output is of only a single value, we can rewrite the equation the following way (in which we have a weights vector):

Since we already calculated the gradient, we now know that the incremental value we need for step one is:

Equation 2

Having calculated the incremental value, we can update vector W2W2 the following way:

Equation 3

Step 2 (In this step, we will need to use both weight matrices. Therefore we will not be omitting the weight indices.)

In our second step we will update the weights of matrix W¹ by calculating the partial derivative of y with respect to the weight matrix W¹.

The chain rule will be used the following way:

obtain the partial derivative of yy with respect to h¯, and multiply it by the partial derivative of h¯ with respect to the corresponding elements in W¹. Instead of referring to vector h¯, we can observe each element and present the equation the following way:

Equation 4

In this example we have only 3 neurons the the single hidden layer, therefore this will be a linear combination of three elements:

Equation 5
Equation 6
Equation 7

Therefore:

Equation 8
Equation 9
Equation 10

The second calculation of equation 9 can be calculated the following way:

(Notice how simple the result is, as most of the components of this partial derivative are zero).

Equation 11

After understanding how to treat each multiplication of equation 9 separately, we can now summarize it the following way:

Equation 12

We are ready to finalize step 2, in which we update the weights of matrix W¹ by calculating the gradient shown in equation 5. From the above calculations, we can conclude that:

Equation 13
Equation 14

Having calculated the incremental value, we can update vector W¹ the following way:

Equation 15

In all of these calculations, we did not emphasize the biased input as it does not change any of the concepts we covered. As I mentioned before, simply consider the bias is a constant input that is also connected to each of the neurons of the hidden layers by weight. The only difference between the bias and the other inputs is the fact that it remains the same as each of the other inputs change.

After updating the weight matrices we begin once again with the Feedforward pass, starting the process of updating the weights all over again.

In this example, for each new input we updated the weight after each calculation of the output. It is often beneficial to update the weights once every N steps. This is called Mini Batch Training and involves averaging the changes to the weight over multiple steps before actually updating the weights.

There are two primary reasons for using mini-batch training:

  1. Reduction of the complexity of the training process
  2. Noise reduction

Content Credit: Udacity Deep Learning Program

--

--