Backpropagation

5 min readOct 5, 2018

Backpropagation is a popular method for training artificial neural networks, especially deep neural networks.

Backpropagation is needed to calculate the gradient, which we need to adapt the weights of the weight matrices. The weights of the neurons (ie nodes) of the neural network are adjusted by calculating the gradient of the loss function. For this purpose a gradient descent optimization algorithm is used. It is also called backward propagation of errors.

A metaphor might help : picture yourself being put in a mountain, not necessarily at the top, by a helicopter at night and/or under fog. Let’s also imagine that this mountain is on an island and you want to reach sea level.

You have to go down, but you hardly see anything, maybe just a few meters. Your task is to find your way down, but you cannot see the path. You can use the method of gradient descent. This means that you are examining the steepness at your current position. You will proceed in the direction with the steepest descent.
You take only a few steps and then you stop again to reorientate yourself. This means you are applying again the previously described procedure, i.e. you are looking for the steepest descend.

Keeping going like this will enable you to arrive at a position where there is no further descend (ie each direction goes upwards). You may have reached the deepest level (global minimum), but you could be stuck in a basin or something. If you start at the position on the right side of our image, everything works out fine, but from the left-side, you will be stuck in a local minimum.

In summary, if you are dropped many times at random places on this theoretical island, you will find ways downwards to sea level. This is what we actually do when we train a neural network.

The actual backpropagation procedure

Assuming we start with a simple (linear) neural network:

with the following example value associated with weights:

We have labels, i.e. target or desired values t for each output value o.
The error is the difference between the target and the actual output:

We will later use a squared error function, because it has better characteristics for the algorithm.

We will have a look at the output value o1, which is depending on the values w11, w21, w31 and w41. Let’s assume the calculated value (o1) is 0.92 and the desired value (t1) is 1. In this case the error is

Depending on this error, we have to change the weights from the incoming values accordingly. We have four weights, so we could spread the error evenly. However, it makes more sense to to do it proportionally, according to the weight values. This means that we can calculate the fraction of the error e1 in w11 as:

This means in our example:

The total error in our weight matrix between the hidden and the output layer looks like this:

The denominator in the left matrix is always the same (scaling factor). We can drop it so that the calculation gets simpler:

This example has demonstrated backpropagation for a basic scenario of a linear neural network.

________________________________________________________________

Now let's review backpropagation for a NON-linear neural network (ie with an activation function).

The derivation of the error function describes the slope. As we wish to descend, the derivation describes how the error E changes as the weight w changes:

Well, given that the error function E over all the output nodes oj (j=1,…nj=1,…n) where n is the number of output nodes is:

we can insert this in our derivation:

We can calculate the error for every output node independently of each other and we get rid of the sum. This is the error for a node j for example:

Applying the chain rule for the differentiation that we learn in Calculus, over the previous term to simplify things:

Assuming a Sigmoid activation function, which is straightforward to differentiate:

takes us to the final complete form — the essential neural network training math:

****

In summary:

Error is calculated between the expected outputs and the outputs forward propagated from the network.
These errors are then propagated backward through the network from the output layer to the hidden layer, assigning blame for the error and updating weights as they go.

Here's the Backpropagation algorithm in pseudocode:

Backpropagation

Written by Jorge Leonel