Backpropagation step by step
Backpropagation
If you understand gradient descent for finding the minimum of a cost function, then backpropagation is going to be a cake walk. Backpropagation is gradient descent but we have to rely on the chain rule of derivatives because of the way layers are arranged in a neural network. That’s it. We will also show a python example of training a network to recognize digits just to get a feel of the whole thing.
A simple example
Let’s start with a simple example that entails a network with no hidden layers. Just an input layer and an output layer. This is meant to understand how the errors of the cost function get propagated back from one layer to the previous layer. Then we will code an actual network with a hidden layer in between.
So suppose we have a set of MNIST images of digits. These are 28x28 pixels black and white images of all the 10 digits. And we want to classify them accordingly. Of course we also have the labels for those images. The first question is how are we going to use these images as input to a network. The obvious way is to treat these images as an array of pixels and not as a matrix. So an image of 28x28 pixels will be treated as an array of 784 pixel values instead. And this will be our input layer. The output layer though will be of size 10 since we have 10 possible labels from 0 to 9. When the network is trained, and we input, say an image with the digit zero, we would like the first neuron value to be a s close as possible to 1 and the rest to be as close as possible to 0. The image below shows this. Of course we are using only 10 boxes to represent the 784 neurons in the input layer and only 5 boxes to represent 10 neurons in the output layer. I did that to make the image manageable :)
Now let’s train this network.
We will initialize the weights with random values in the [-1,1] interval
We will use MSE as a cost function. Cross entropy actually works better for classification tasks as we’ve seen in the logistic regression section. But MSE is fine as well.
We know from the linear regression section that the MSE function looks like this:
Where y is the desired output or the label and y’(a) is the actual output. And we sum up over all the data points in the training set and that will give us an error number given the values of the weights at that moment. And we would like to minimize this error by using gradient descent, a.i. To somehow change the weights values such that these error values get smaller and smaller over time. In our toy network we have just one set of weights. The ones between the input and the output layer. These are the ones that we would like to change. Now let’s input an actual image of a zero when the weights are randomly initialized. Of course we do not expect much. And the output values won’t mean anything. What we’ll do is we will compute the cost function just for this example. We will subtract the desired value of the output neurons from the predicted value. Image below:
Let’s look at the first neuron in the output layer. We want this one to be 1 and in this case is 0.3. The value of this neuron is given by this formula that we’ve seen in the logistic regression section:
Image below:
We now know that to minimize the cost function we need to run the Gradient descent algorithm. And for that we need to compute the partial derivative of the cost function w.r.t earth weight w and then update those weights accordingly. So let’s just do that. Trying to compute the derivative of C w.r.t. w00 will give us the following using the chain rule:
Let’s unpack this.The cost function for just one example is going to be:
Derivating this will give us (desiredValue — predictedValue) because
But predictedValue itself is a sigmoid. So according to the chain rule, if
So we also have to derive the sigmoid as well. The derivative of the sigmoid as we’ve seen is :
But again, sigmoid is not a function of just one x here. It is a sum of the image values multiplied by their weights. So we have to apply the chain rule again. So the derivative of the sigmoid could be rewritten as follows:
So since we are computing the partial derivative w.r.t. w00 , the rest of the terms in the weighted sum except the first one, image[0]w00 , will cancel out. So we’re left with the last chain rule that we need to apply for a function like this :
In general we will refer to the weighted sum of a neuron i a zi:
And the sigmoid result of that neuron is the activation. And we’ll note it with a
So in a nutshell backpropagation computes gradient descent to minimize the cost function but because the way layers in a neural net are arranged which make them composed function, backpropagation is nothing more that computing chain rule after chain rule in order to get that value of the derivative of the cost function with respect to a weight deep down in the layers of the net.
This picture should explain the chains used in our previous example:
Adding a hidden layer
Lets now add a hidden layer in this network. And now the question is how are we going to compute the derivatives.
So trying to compute the derivative of the cost function w.r.t. w01 is the process that we’ve just gone through in the previous section. But now we have a hidden layer and some weights are “behind” that layer. Look at weight w01 in the image above. Let’s try to compute the derivative of the cost function with respect to that weight and see what we get. Let’s pretend that used an image like one of digit zero as input and we’ve got some values on each neuron as in the image below:
The cost function for one example is as in the previous section :
Here y is the desired output for the output neurons. And a is the activation of those neurons. And those are vectors since we have more than one output neuron. The cost function on just the first neuron is :
and the derivative of the cost on just this neuron w.r.t. a0 is
But a0 is a sigmoid of weighted sum of the values of the hidden neurons with their respective weights. If we want to compute the derivative of C0 with respect to h0 we’ll get :
But h0 just like a0 is a sigmoid of the weighted sum in the first layer. If we go further and compute the derivative of C0 with respect to w00 we’ll get this formula:
So are we done ? Not quite. If you look once again in the picture above, we see that w00 depends not only on C0 but on the rest of the costs of the other output neurons. Look at the image below:
Following the logic described earlier, we also need to compute the derivative of w00 with respect to C1 and to C2. And they look like this:
We observe that the last terms, (h0)(1-(h0)i0 are common in all three derivations. So we can generalize this formula like this:
Where p is the number of neurons in the output layer.
Deep learning, vanishing gradient and RELU
The term deep learning basically means that we have more than one hidden layer in a network. Any network with more than one hidden layer is considered a deep network. That’s the whole mystery about this term. But the question is does adding more layers to a network works ? Looking at some recent examples, it definitely works. But not with sigmoid as an activation function.
Why sigmoid is not a good choice as an activation function
Let’s suppose we have this neural network with one input layer and one output layer and two hidden layers and all the neurons are sigmoid neurons.
Now let’s look at a scenario where we do want to compute the derivative of the cost function with respect to w0. So we start with random initialized weights as in the image below. And let’s suppose that we want the output to be 1.
The derivative of C w.r.t. w0 will look like this:
Let’s do the computation given the weight values from the above image.
And the result of this calculation is 0.000008821075. Which is very very small. So when training, we will update the value of w0 with a fraction of this value. Because we also have to take the learning rate into consideration. The result will be an even smaller value thus the value of that weight will “learn very slowly” so to speak. So adding more layers with sigmoid neurons to a problem will not make work better. In fact it will make it worse.
Using RELU instead
Relu stands for rectified linear unit. It is a very simple activation function defined like this:
Yes. The simpler the better. Unlike the sigmoid which has a rather complicated formula, relu is very simple. The function graph looks like this:
Now let’s compute the derivative of w0 of the same network from above but with RELU as activation function for the neurons. This is now the values of the neurons when we are using the RELU:
The derivative is this:
Since the weights were all positive, and the input itself is positive, then the results of the neurons are positive also, in this case the RELU is the identity function. And the derivative of the identity function 1. So those derivatives do not contribute to anything at the derivative computation as was the case with the derivative of the sigmoid which was driving the value very very low.
Leaky RELU for even better results
The example above with RELU looks nice because all the weights and the input were positive. But when we do initialize weights, we usually initialize them with a normal distribution with mean zero and standard deviation 1. So almost half of them will be negative. Let’s take an example where one of the weights is negative.
The derivative will be:
The derivative is zero because one of the weights being zero, has produced a negative RELU on the output neuron. And when the result is zero, RELU function is a constant =0 and the derivative of that is 0. And that cancels out all the computation. And the weight in this case will not be initialized with anything. This case is known as a dead relu.
To mitigate this we can use Leaky RELU :
And here’s how the graph looks like:
When the x is positive, we get just like with RELU the identity function. But when x is negative we get a small negative value rather than 0. So this way we can avoid dead RELUs in the network.
Let’s run the previous example but with leaky RELU as an activation function. This is the definition of the function:
And this is how the neuron’s results will be:
The derivative for the first weight is this:
And I guess that’s all for today