Backpropagation step by step

11 min readNov 30, 2023

Backpropagation

If you understand gradient descent for finding the minimum of a cost function, then backpropagation is going to be a cake walk. Backpropagation is gradient descent but we have to rely on the chain rule of derivatives because of the way layers are arranged in a neural network. That’s it. We will also show a python example of training a network to recognize digits just to get a feel of the whole thing.

A simple example

Let’s start with a simple example that entails a network with no hidden layers. Just an input layer and an output layer. This is meant to understand how the errors of the cost function get propagated back from one layer to the previous layer. Then we will code an actual network with a hidden layer in between.

So suppose we have a set of MNIST images of digits. These are 28x28 pixels black and white images of all the 10 digits. And we want to classify them accordingly. Of course we also have the labels for those images. The first question is how are we going to use these images as input to a network. The obvious way is to treat these images as an array of pixels and not as a matrix. So an image of 28x28 pixels will be treated as an array of 784 pixel values instead. And this will be our input layer. The output layer though will be of size 10 since we have 10 possible labels from 0 to 9. When the network is trained, and we input, say an image with the digit zero, we would like the first neuron value to be a s close as possible to 1 and the rest to be as close as possible to 0. The image below shows this. Of course we are using only 10 boxes to represent the 784 neurons in the input layer and only 5 boxes to represent 10 neurons in the output layer. I did that to make the image manageable :)

Now let’s train this network.

We will initialize the weights with random values in the [-1,1] interval

We will use MSE as a cost function. Cross entropy actually works better for classification tasks as we’ve seen in the logistic regression section. But MSE is fine as well.

We know from the linear regression section that the MSE function looks like this:

Figure 2: Mean Squared Error Cost function

Where y is the desired output or the label and y’(a) is the actual output. And we sum up over all the data points in the training set and that will give us an error number given the values of the weights at that moment. And we would like to minimize this error by using gradient descent, a.i. To somehow change the weights values such that these error values get smaller and smaller over time. In our toy network we have just one set of weights. The ones between the input and the output layer. These are the ones that we would like to change. Now let’s input an actual image of a zero when the weights are randomly initialized. Of course we do not expect much. And the output values won’t mean anything. What we’ll do is we will compute the cost function just for this example. We will subtract the desired value of the output neurons from the predicted value. Image below:

Figure 3: Subtract the label value from the predicted value of each output neuron

Let’s look at the first neuron in the output layer. We want this one to be 1 and in this case is 0.3. The value of this neuron is given by this formula that we’ve seen in the logistic regression section:

Image below:

Figure 4: How to compute the prediceted value of a sigle neuron

We now know that to minimize the cost function we need to run the Gradient descent algorithm. And for that we need to compute the partial derivative of the cost function w.r.t earth weight w and then update those weights accordingly. So let’s just do that. Trying to compute the derivative of C w.r.t. w00 will give us the following using the chain rule:

Figure 5: Derivative of w00 w.r.t to the Cost function

Let’s unpack this.The cost function for just one example is going to be:

Figure 6: The cost function for one example

Derivating this will give us (desiredValue — predictedValue) because

Figure 7: Derivative of the x squared function

But predictedValue itself is a sigmoid. So according to the chain rule, if

Figure 8: Formula for chanin rule for derivatives

So we also have to derive the sigmoid as well. The derivative of the sigmoid as we’ve seen is :

Figure 9: Formula for the sigmoid derivative

But again, sigmoid is not a function of just one x here. It is a sum of the image values multiplied by their weights. So we have to apply the chain rule again. So the derivative of the sigmoid could be rewritten as follows:

Figure 10: Derivative of the sigmoid of the activation a w.r.t to w00 weight. We will just cancel out all the terms that do not involve the w00 derivative.

So since we are computing the partial derivative w.r.t. w00 , the rest of the terms in the weighted sum except the first one, image[0]w00 , will cancel out. So we’re left with the last chain rule that we need to apply for a function like this :

Figure 11: Formula for the sigmoid derivative with a constant factor n

In general we will refer to the weighted sum of a neuron i a zi:

Figure 12: The weighted sum on a neuron i

And the sigmoid result of that neuron is the activation. And we’ll note it with a

Figure 13: The result of the sigmoid of the weighted sum z will be called the activation of the neuron i

So in a nutshell backpropagation computes gradient descent to minimize the cost function but because the way layers in a neural net are arranged which make them composed function, backpropagation is nothing more that computing chain rule after chain rule in order to get that value of the derivative of the cost function with respect to a weight deep down in the layers of the net.

This picture should explain the chains used in our previous example:

Figure 14: The derivative of the cost dfunction w.r.t. to w00 weight. The general formula

Adding a hidden layer

Lets now add a hidden layer in this network. And now the question is how are we going to compute the derivatives.

So trying to compute the derivative of the cost function w.r.t. w01 is the process that we’ve just gone through in the previous section. But now we have a hidden layer and some weights are “behind” that layer. Look at weight w01 in the image above. Let’s try to compute the derivative of the cost function with respect to that weight and see what we get. Let’s pretend that used an image like one of digit zero as input and we’ve got some values on each neuron as in the image below:

The cost function for one example is as in the previous section :

Figure 17: Again, the cost function for just one example(a.i. one image used)

Here y is the desired output for the output neurons. And a is the activation of those neurons. And those are vectors since we have more than one output neuron. The cost function on just the first neuron is :

Figure 18: The cost function of the first output neuron

and the derivative of the cost on just this neuron w.r.t. a0 is

Figure 19: The derivative of the first output neuron

But a0 is a sigmoid of weighted sum of the values of the hidden neurons with their respective weights. If we want to compute the derivative of C0 with respect to h0 we’ll get :

Figure 20: The derivative of the first output neuron w.r.t. to h0

But h0 just like a0 is a sigmoid of the weighted sum in the first layer. If we go further and compute the derivative of C0 with respect to w00 we’ll get this formula:

Figure 21: The derivative of the first output neuron w.r.t. to w00

So are we done ? Not quite. If you look once again in the picture above, we see that w00 depends not only on C0 but on the rest of the costs of the other output neurons. Look at the image below:

Figure 22: We need to compute all the derivatives of the cost function for each output neuron w.r.t. to w00

Following the logic described earlier, we also need to compute the derivative of w00 with respect to C1 and to C2. And they look like this:

Figure 23: The derivative of the second and third output neuron wr.t. to w00

We observe that the last terms, (h0)(1-(h0)i0 are common in all three derivations. So we can generalize this formula like this:

Figure 24: The general formula for computing the derivative of the whole cost function wr.t. to w00

Where p is the number of neurons in the output layer.

Deep learning, vanishing gradient and RELU

The term deep learning basically means that we have more than one hidden layer in a network. Any network with more than one hidden layer is considered a deep network. That’s the whole mystery about this term. But the question is does adding more layers to a network works ? Looking at some recent examples, it definitely works. But not with sigmoid as an activation function.

Why sigmoid is not a good choice as an activation function

Let’s suppose we have this neural network with one input layer and one output layer and two hidden layers and all the neurons are sigmoid neurons.

Figure 25: A very simple deep network with two hidden layers

Now let’s look at a scenario where we do want to compute the derivative of the cost function with respect to w0. So we start with random initialized weights as in the image below. And let’s suppose that we want the output to be 1.

Figure 26: The previous network with it’s weights initialized and the input being 1

The derivative of C w.r.t. w0 will look like this:

Figure 27: The derivative w.r.t. to w00 of the cost function

Let’s do the computation given the weight values from the above image.

Figure 28: The whole computation of the derivative w.r.t. w00 of the cost function

And the result of this calculation is 0.000008821075. Which is very very small. So when training, we will update the value of w0 with a fraction of this value. Because we also have to take the learning rate into consideration. The result will be an even smaller value thus the value of that weight will “learn very slowly” so to speak. So adding more layers with sigmoid neurons to a problem will not make work better. In fact it will make it worse.

Using RELU instead

Relu stands for rectified linear unit. It is a very simple activation function defined like this:

Yes. The simpler the better. Unlike the sigmoid which has a rather complicated formula, relu is very simple. The function graph looks like this:

Now let’s compute the derivative of w0 of the same network from above but with RELU as activation function for the neurons. This is now the values of the neurons when we are using the RELU:

Figure 31: Another instance of the very simple but deep neural net presented previously

The derivative is this:

Figure 32: The derivative of the cost function w.r.t. to w00

Since the weights were all positive, and the input itself is positive, then the results of the neurons are positive also, in this case the RELU is the identity function. And the derivative of the identity function 1. So those derivatives do not contribute to anything at the derivative computation as was the case with the derivative of the sigmoid which was driving the value very very low.

Leaky RELU for even better results

The example above with RELU looks nice because all the weights and the input were positive. But when we do initialize weights, we usually initialize them with a normal distribution with mean zero and standard deviation 1. So almost half of them will be negative. Let’s take an example where one of the weights is negative.

Figure 33: Another instance of the very simple but deep neural net presented previously

The derivative will be:

Figure 34: The derivative of the cost function w.r.t. to w00

The derivative is zero because one of the weights being zero, has produced a negative RELU on the output neuron. And when the result is zero, RELU function is a constant =0 and the derivative of that is 0. And that cancels out all the computation. And the weight in this case will not be initialized with anything. This case is known as a dead relu.

To mitigate this we can use Leaky RELU :

And here’s how the graph looks like:

When the x is positive, we get just like with RELU the identity function. But when x is negative we get a small negative value rather than 0. So this way we can avoid dead RELUs in the network.

Let’s run the previous example but with leaky RELU as an activation function. This is the definition of the function: