Backpropagation from the beginning

8 min readDec 4, 2016

This article has been republished at Educaora and has also been open sourced. Any help to make better, adding questions or clarifying the explanations are greatly appreciated!

I have tried to understand backpropagation by reading some explanations, but I’ve always felt that the derivations lack some details. In this article I will try to explain it from the beginning hopefully not leaving anything out (theory wise at least). Let’s get started!

Forward propagation

We have a fully-connected feed-forward neural network. It has L layers (could be any number) and any number of neurons in each layer. The activations of the neurons in layer l is stored in an activations column-vector aˡ, where the superscript index denote the layer. The connections from the neurons in layer l-1 to the layer l are stored in a weight matrix Wˡ, and the biases for each neuron is stored in a bias column-vector bˡ.

For a simple forward pass we have that:

The matrix multiplications in this formula is visualized in the figure below, where we have introduced a new vector zˡ. which is the activation without the application of a component-wise transfer function, so that aˡ = σ( zˡ ). I will call this value the “input sum” of a neuron.

Schematic of a fully connected layer and a matrix multiplication without transfer-function applied.

The whole network is shown below, from the input vector x, to the output activation vector aᴸ. The connections leading in to a specific neuron is shown in colors in two layers:

One thing we can do, just to get a feeling of it is to write the network computation into one mathematical expression. Let’s write down the formula for calculating the n:th element of the output vector in the final layer:

Formula for the computation of the whole network.

Here we have introduced new notation where wˡᵤᵥ is denoting the connection from the v:th neuron in layer l-1 to the u:th neuron in layer l, and bˡᵤ is the bias of the u:th neuron in layer l. The expression can be a bit confusing, particularly because of all the new indices. But the biggest takeaway from this is that the neural network is just a mathematical function. And this function can be derived with respect to any variable. We will use our newly introduced notation, and define an error function, or “cost” function C using a sample of our training data, and then see how the error changes as we change our weights.

Deriving the error

Three adjacent layers anywhere in the network are shown in the figure below, the index letter for the neurons in the layers are j, k and m respectively, and the index letter for the layer is l.

Tree layers anywhere in the network, derivative is taken with respect to the weight shown in red. The middle neuron is enlarged for visualization purposes.

First calculate the input sum of a neuron k in layer l:

Then take the transfer function, it could be any function with a first derivative:

Now finally calculate the input sum of a neuron m in layer l+1.

Here we have gone forward one step in the layers, from the activations in layer l-1 to the input sums of neurons in layer l+1. An error function C is defined using one example from our training data, and its derivative is calculated with respect to a single weight in layer l.

You may notice the sum in the expression above, it is due to the chain rule, all contributions from the neurons in layer l+1 have to be accounted for since their value is affecting the end error (their value is depending on the weight that we are taking the derivative with respect to). This is visualized in the figure shown above.

One important thing to remember is that we have fixed k and j, thus we only see how the error changes when one single weight is manipulated in the calculation. All the other weights are held constant, and the derivative of a constant is simply zero. But the m index in layer l+1 is not fixed, the activation of all neurons in that layer are changed as we change our specified weight.

Error signal

Now let’s make a definition, the “error signal” of a neuron k in layer l as how much the total error changes when the input sum of the neuron is changed:

So every neuron in the whole network now has an error signal defined. But if you look at the equation above, you will see that we have already expanded this expression:

So we have a recursive formula for the error signals, using our definitions:

You may wonder what happened with the biases, they are also “weights” and the error function should be derived with respect to them as well.

So the gradient of the cost function with respect to the bias for each neuron is simply its error signal!

Propagating backwards

In order to use this recursive formula we need to obtain the first error signal in the series, i.e. the error signal of the neurons in the final layer L. This is the starting value, after having this we can “propagate” backwards calculating all the error signals.

The only derivative that has to be calculated here is the derivative of the cost function with respect to the activations of the last layer, which is the output of the network. So this will depend on the error-function we choose.

Matrix form

Since we now can recursively calculate all the error signals for all neurons in the network, it is possible to obtain the derivative of the cost function with respect to all weights. Training is done by subtracting from each weight a small fraction of its corresponding derivative, called the delta rule. Since everything is known, the network is trainable! But we would like a more efficient notation, since we previously only worked with scalars. Vector and matrix notation is to the rescue! Remember that we stored all weights for layer l in a matrix Wˡ, where the weights connecting from all neurons in l-1 to a neuron in l are stored as as rows? Take a look at the first figure in the article to be reminded about this. Now use the weight matrix, take its transpose. That will mean that the connections to the neurons in the layer are stored as columns. Put all the error signals for the layer in a column vector, and multiply it with with the transposed weight matrix. Sorry about this, but new notation again, there are a total of M neurons in layer l+1 and K neurons in layer l.

The similarity sign is because we are lacking the multiplication of the derivative of the input sum in layer l. Make sure you understand this matrix multiplication, use pen an paper to sketch the calculations if possible.

Finally using all we derived previously, we can state formulas for calculating the gradient of the cost function with respect to the weights as following:

The nabla symbol is the derivative with respect to the output variables of the final layer, and the dot with a circle denotes component-wise multiplication.

Outline of training algorithm

Using these formulas we can effectively write an algorithm train the network, using single training sample at a time.

Do forwards propagation, calculate the input sum and activation of each neuron by iteratively do matrix-vector multiplication and taking component-wise transfer function of all neurons in every layer. Save the results for later.
Calculate the the error signal of the final layer L, by obtaining the gradient of the cost function with respect to the outputs of the network. This expression will depend on the current training sample (input and output), as well as the chosen cost function.
Do backwards propagation, calculate the error signals of the neurons in each layer. The input sum of each neuron is required to do this, and it is also done by iteratively computing matrix-vector multiplications (see formula above).
Calculate the derivative of the cost function with respect to the weights, the activation of each neuron is required to do this (see the formula above). This will be a matrix with the same shape as the weight matrix.
Calculate the derivative of the cost function with respect to the biases (see the formula above). This will only be a column vector.

7. Update the weights according to the delta rule.

Next step

This explanation is how to train a network using only one training sample at a time, but how to do it using batch learning? What we can do is just putting the inputs of the training samples as columns in a matrix, doing the forward propagation the input sums and activations will also be in matrices, where column index is the sample index, and the row index is the neuron index.

Similarly the error signals for different layers and samples will be in matrices. However, the derivatives of the cost function with respect to the weights will be in a three-dimensional tensor using the dimensions [sample_idx, neuron_idx, weight_idx]. Reducing a sum over the samples in this tensor will give the gradient matrix for the weights in the actual layer, and the weights can be updated using the delta rule.

In the next article we will be building a simple neural network from scratch, that can be trained using this theory (coming soon).