My Machine Learning Diary: Day 32

6 min readNov 20, 2018

--

Today I finally somewhat understand what’s going on with back propagation in neural networks. I also used some resources other than Coursera ML. This video by 3Blue1Brown was very helpful to visualize how back propagation works. This book by Michael Nielse nwas brilliant explaining the math behind back propagation.

Back Propagation

Previously, we saw how the network gives us outputs with forward propagation. Back propagation is the process of the network to adjust its weights to minimize the error.

Notation

z: computed output. (wa = z)
𝑎: activation. 𝑎 = sigmoid(z)
superscript 𝑙: 𝑙-th layer
subscript k: k-th neuron
subscirpt jk (of weight, with superscript l): weight from k-th neuron in layer 𝑙-1 to j-th neuron in layer 𝑙
L: number of layers

The Goal

Our final goal of back propagation is to figure out by how much should we adjust each weight to minimize the error. Namely, we want to know the derivative of cost with respect to each weight and update the weights accordingly.

Code Overview

The pseudo code below briefly shows the process of learning in neural network.

For each sample, we perform forward propagation and get activations of each layer. Then we will use back propagation to get the derivative of the cost with respect to each weight in each layer. Finally we take the mean of it across all the samples.

Cost Function

To achieve the goal, we need to first find out the cost. The cost function for neural networks is very similar to the one from logistic regressions. Now, we just could possibly have more than one output. Namely, we can have more than one neuron in the output layer. So the cost function is defined as follow:

Cost Function of Neural Networks (Classification)

Algorithm

This was the most confusing part that I couldn’t understand yesterday. It is very math heavy, but I hope I can summarize it well as I’m editing this. It is always good to remind myself what the goal is because it will be a long math proof. What we want is this:

Let’s change (2) into something easier to compute. Using chain rule, we get this:

Since I know the second term and third term of (3) appear a lot later, I will put them together and define it as δ:

Therefore, the derivative of cost with respect to a weight can be expressed as follow:

Let’s see if we can compute (5) when 𝑙 is L, namely, the derivative of cost function with respect to weights of the output layer.

Let’s see if we can find out what the first term of (6) is. Recall notation z is the weighted sum from previous layer. Using the language of math, z can be written as follow:

Definition of z (K indicates number of neurons in layer 𝑙-1)

Therefore, the first term of (6) gives us this:

We successfully computed the first term of (6)! Next, let’s find out what the second term δ is going to be when 𝑙 = L. From the expression (4) we get this:

Recall the notation 𝑎 simply means the sigmoid of z. Therefore, 𝑎 can be written as such:

Definition of 𝑎 (Greek letter σ indicates sigmoid function)

Thus, the first term of (9), namely, the derivative of 𝑎 with respect to z is just priming the sigmoid function.

Here we used the fact σ’(z)=σ(z)(1-σ(z)). The proof is shown below:

Now let’s come back to our original track. So we got the first term of (9). Let’s see what the second term, the derivative of cost with respect to 𝑎, is going to be. Recall we already defined what the cost is previously as (1). The hypothesis in this case is simply sigmoid function. So the cost function can also be written as follow:

The second term of (9) is taking the derivative of (13) with respect to 𝑎.

Now we got the second term of (9). Therefore, combining (11) and (14), (9) can be expressed as follow:

δ when 𝑙 = L

From (8) and (15), (6) can be expressed like this:

Derivative of C with respect to weights in layer L

We finally got what we want when 𝑙 = L. Now, let’s generalize what we just computed. Namely, we want to know what (2) is going to be in general. From (3) and (4), we can express (2) as such:

Let’s figure out what the second term δ of (17) is going to be.

Let’s find out what the first term of (19) is going to be. From (7) and definition of 𝑎, we get this:

Taking the derivative of (20) with respect to z at layer 𝑙, we get this:

From (19) and (21),

Now let’s get side track a little and look at (17). From (8), (17) can be written as this:

Since we already know how to compute δ, that’s it! We have successfully found the derivative of cost with respect to each weight. Note that to compute the weights at layer 𝑙, we need δ at 𝑙-th layer, which in turns requires δ at (𝑙+1)th layer, and so on. Thus, to compute (23), we need to compute (22) for all layers. From (15) we know what δ is going to be when 𝑙=L. Therefore, with (22), we can compute δ when 𝑙=L-1, L-2, …, 2. This is where the name back propagation comes from as it propagates back to compute the δ values. Thus, the pseudo code we saw previously can be revised as follow:

Pseudo Code of Neural Network Training (Revised)

Now let’s see if we can vectorize these expressions to make it look good. First, (22) can be written as such with vectorization:

Vectorized δ (⊙ is element wise multiplication)

Similarly, (23) can be written as follow:

Phew… that was pretty much it. Big thanks to 3Blue1Brown and Michael. Without the video and article, it was difficult to understand the math behind back propagation. Also, Andrew’s video clarified the steps of how to train the neural networks, which was very helpful too.