A beginner’s guide to deriving and implementing backpropagation

Pranav Budhwant
binaryandmore
Published in
10 min readJul 16, 2018
“Equations written in chalk on a worn-out blackboard” by Roman Mager on Unsplash

Plan of attack

This article is divided into two sections:
1. Derivation — In this section, we will be deriving all the required formulae for performing backpropagation. I strongly recommend that you derive the equations on paper as you read through the article.
2. Implementation — In this part, we will use the derived formulae to implement backpropagation from scratch. We will be solving a binary classification problem in python using numpy.

Disclaimer
This article assumes a basic understanding of neural networks and how they work. If you are not familiar with neural networks, or think your concepts are a little rusty, you may want to review chapter 1 of the amazing book, Neural Networks and Deep Learning, by Michael Nielsen, or if you prefer video lectures, you might want to go through this course on Coursera.

Derivation

Consider the following network architecture:

Neural Network Architecture for reference

Notations
For every layer in the network, there are certain parameters associated with it. Let us define some notations.

Notations used in the derivations

Now that we have the notations in place, let’s dive into the math.

We know that training a network requires three steps.

  1. Forward propagation — The first step is calculating the activation of each layer one by one, starting from the first layer, until we have the activation of the last layer, a[3] or y’.
  2. Computing the cost — The second step is calculating the value of the cost function C(y’, y) and
  3. Backpropagation — The final step is updating the weights and biases of the network using the backpropagation algorithm.

Forward Propagation
Let X be the input vector to the neural network, i.e. a[0] = X.
Now, we need to calculate a[l] for every layer l in the network.

Before calculating the activation, a[l], we will calculate an intermediate value z[l]. Each element k in z[l] is just the sum of bias for the neuron k in the layer l with the weighted sum of the activation of the previous layer, l-1.
We can calculate z[l] from the following equation:

Equation for calculating the weighted input to the neurons in layer l

Where ‘.’ refers to the matrix multiplication operation, and + refers to the matrix addition.
Now that we have z[l], we can compute a[l] easily by applying the activation function g[l] element-wise to the vector z[l].

Equation for calculating the activation of layer l

With the above two equations, we can calculate the activation of each layer in the network. The activation of the last layer gives us the predicted output of the network, y’.

Selecting the Cost Function C, and computing the cost
Okay, now we know how we will forward propagate the input X through the network and obtain the predictions. But now that we have the predictions, we need a way to quantitatively determine how good/bad these predictions are.

So, how do we go about this?

To do this, we define some function, known as the cost function, which will be a function of the predicted values and the ground truth.

This function will estimate how badly the model is performing. Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between the input and the output. This cost function (also referred to as loss or error) can be estimated by iteratively running the model to compare estimated predictions, y’ against “ground truth” — the known values of y.

The objective of a the model, therefore, is to find parameters, weights and biases, that minimize the cost function.

That is why choosing the right cost function for achieving the desired result is a critical point of machine learning problems. The choice of the cost function depends on the application. Here is a good discussion about choosing the right cost function.

In this article we will be using the cross-entropy cost function, which is calculated as follows:

Equation for the Cross Entropy cost

Here y is the actual output, the ground truth, and y’ is the predicted output, or, a[3] in this case.
*Note: Here log refers to the natural logarithm.

Backpropagation
The goal of backpropagation is to compute the partial derivatives of the cost function C with respect to any weight w or bias b in the network.
Once we have these partial derivatives, we will update the weights and biases in the network by the product of some constant alpha and the partial derivative of that quantity with respect to the cost function. This is the popular gradient descent algorithm.
The partial derivatives give us the direction of greatest ascent. So, we take a small step in the opposite direction — the direction of the greatest descent, i.e. the direction which will take us to the local minima of the cost function.

Visualization of the Gradient Descent algorithm Source: Machine Learning Course by Andrew Ng

The update rules looks like this:

Equations of the update rules for the weights and biases of the layer l

Here alpha is known as the learning rate of the network, because it decides how big updates we perform, in other words, it tells us how big steps are we taking in the direction of the local minima, i.e. what is the rate at which we are learning.

To better understand the flow of computations, let us draw a computation graph.

Forward Propagation Computation Graph

From the above computation graph, we can make a few observations which will help us calculate the partial derivatives.
C is a function of a[3], a[3] is in turn a function of z[3], and z[3] is a function of w[3], b[3] and a[2]. So to calculate the partial derivative of the cost function C with respect to w[3], b[3], we will have to use the chain rule.
This gives us:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 3

Similarly, we need the derivative of C with respect to w[2], b[2]. From the above computation graph, we can see that z[3] is a function of a[2], which(a[2]) is a function of z[2], and in turn z[2] is a function of w[2], b[2] and a[1].
So applying the chain rule again, we get:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 2

Likewise, for calculating the partial derivatives with respect to w[1], b[1], we would use:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 1

This can be better understood from the following figure:

Backpropagation Computation Graph

This gives us a better picture of how backpropagation actually works.

So now that we have the understanding of what backpropagation is actually doing, let us generalize the above equations for the partial derivatives, which would make it easier for us to convert them into code.

We have:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of any layer l

So for calculating the partial derivatives of C with respect to w[l], b[l], we need to calculate

Four fundamental partial derivatives required to perform backpropagation

where L is the last layer.
*We need to calculate dC/dz[L] separately, and then dC/dz[l] will be a general formula to calculate the partial in terms of the partial derivative of the l + 1 layer, i.e. dC/dz[l+1].

For our convenience we will assume the activation function for each layer g[l] to be the sigmoid activation function. As the activation function remains the same for all the layers, we will drop the superscript [l].
The sigmoid function is defined as:

Sigmoid function

Partial derivative of C with respect to z[L]
The partial derivative of C with respect to z[L] can be calculated as follows:

Derivation of the partial derivative of the cost function with respect to the activation of the last layer

Where .* represents element-wise multiplication of the matrices, also known as the Hadamard product. We multiply element-wise to make sure that all the dimensions of our matrix multiplications match up as expected.

Derivative of the activation function:

Derivation of the derivative of the sigmoid activation function

Putting it all together

Derivation of the partial derivative of the cost function with respect to z[L]

Partial derivative of C with respect to z[l]
We want the partial derivative of C with respect to z[l] in terms of the partial derivative of C with respect to the layer l+1, so that once we have z[L], we can calculate z[L-1], z[L-2].. and so on.
We can express C as a function of z[l + 1] for any given l. Therefore, we can write:

Equation for the partial derivative of the cost function with respect to z[l] in terms of the partial derivative of the cost function with respect to z[l + 1]

As we will first be calculating the dC/dz[L] term, we can then calculate dC/dz[l] for every l = L-1, L-2 .. and so on by substituting the values in the above equation.
We still have to calculate dz[l+1]/da[l] and da[l]/dz[l]. Let us see how we can do that.

Derivations of the partial derivative of z[l + 1] with respect to a[l] & the derivative of a[l] with respect to z[l]

Putting it all together we get:

Equation for the partial derivative of the cost function with respect to z[l]

Note that we have adjusted the terms to make sure our matrix multiplication dimensions match as expected. Here ‘.’ represents the matrix multiplication operation and .* represents the element-wise product as above.

Partial derivative of z[l] with respect to w[l]
Now let us calculate dz[l]/dw[l]

Derivation of partial derivative of z[l] with respect to w[l]

Partial derivative of C with respect to w[l]
Using the derived equations,

Equation for the partial derivative of the cost function with respect to w[l]

Partial derivative of z[l] with respect to b[l]

Derivation of the partial derivative of z[l] with respect to b[l]

Partial derivative of C with respect to b[l]
Using the derived equations,

Equation for the partial derivative of the cost function with respect to b[l]

Note that these derivations consider input X as only one training example, for a batch of m examples, we have to average the derivatives over the m examples.

That completes the derivation of the fundamental equations of backpropagation. It’s really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That’s all there really is to backpropagation — the rest is details.

While deriving the above equations we have considered the activation function g[l] for all the layers to be the sigmoid activtion function. But that is hardly the case when we implement neural networks for solving real problems.

Here is a list of activation functions along with their equations and the derivatives.

Activation Function Cheatsheet by Towards Data Science

Implementation

Let us now look at the implementation of the backpropagation algorithm.
You need to have a little experience with numpy to be able to understand the code.

We will be implementing a very basic seed classification problem.
The original dataset can be downloaded from the UCI Machine Learning Repository. In this implementation, we will be using only a subset of the original dataset, as we will be implementing a binary classification problem and the original dataset contains 3 classes.

The modified dataset used in this code, along with the code can be downloaded from this GitHub repository.

Let us start by importing the required libraries:

We have assumed the activation function to be the sigmoid activation function, so as per the formulae derived above, we will need to implement the sigmoid function as well as sigmoid_prime, the derivative of the sigmoid function.

Now, we will define a class NeuralNetwork which will have the implementation of all the required functions for creating and training the network.

Let us understand the use of the functions defined above.
1. __init__ : This is the constructor of the class, which is responsible for creating and initializing several class variables which will define the neural network, such as the network architecture, various weight matrices, bias vectors, activation vectors etc.
2. forward_propagate : This function performs the forward propagation step. It accepts the input for the network and calculates the activation layer by layer.
3. compute_cost : This function calculates the cross entropy cost for the given y and the current activation of the final layer, a[L].
Now, the backpropagation is performed by the combination of two functions, compute_derivatives and update_parameters.
4. compute_derivatives : This function is responsible for calculating all the partial derivatives with respect to the cost function.
5. update_parameters : This function performs the updates for the weights and the biases of the network.
6. predict : This function just calculates the predictions for the given input.
7. fit : This function contains the training part of the network. It implements the stochastic gradient descent algorithm.

Finally, let us import the dataset, perform a little preprocessing and train our model.

Putting it all together:

Implementation of backpropagation in python using numpy

The content in this article is free to use as long as the article is cited.

References

If you learned something new from this article, please hit the 👏 icon to support it. This will help other Medium users find it. Share it, so that others can read it.

--

--