How did I understand backpropagation?

Published in

pendulibrium

7 min readAug 17, 2018

Geoffrey Hinton — Godfather of Deep Learning

Since I got familiar with machine learning, I never fully understood backpropagation until 6 months ago. I went through Andrew Ng’s ML course, but I just couldn’t grasp how this mastery worked. In a quest for a simple explanation, I went through tons of videos and blog posts, but it wasn’t until I decided to look at the math when I fully understood backprop. If you are not excited about understanding equations, you may want to skip this post, but I really advise you not to.

The backpropagation algorithm was firstly introduced in the 1970s, but it started being appreciated by the machine learning community only after the publication of “Learning Representations by Back-Propagating Errors” in 1986.

So if you were thinking that it magically appeared on the doorstep of ML (I know I did), be sure it didn’t.

“Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of weights to be used in the network.”

Okay, this is a little bit too much information to process from one sentence, let’s break it down!

Let’s assume we have a fully connected neural network. The network has 4 layers: an input layer, 2 hidden layers, and an output layer. Each of the layers can have any number of neurons, also known as units. Every neuron, without exception, is connected to all neurons in its previous and its consequent layer forming connections. Each connection has associated weight and bias terms (parameters of the network), so signals (information) traveling through connections are weighted by these factors. Units take the sum of all weighted connections from the previous layer and forward the activations of the sum to the next layer.

The goal of this post is backpropagation, so in case you are not familiar with the architecture and the forward pass, you can watch the videos from the Coursera’s Neural Networks and Deep Learning course here.

💡 Remind yourself!

The activation of the neuron simply calculates whether the neuron will fire or not, given a certain input. You can read more about neurons firing in “Neural Networks and Deep Learning” by Michael Nielsen. In this post, we are using sigmoid activation function given with the first equation below.

The input of the activation function is actually the sum of weighted connections from the previous layer, presented with the second equation. The equations above are vectorized.

Essentially, every neuron part of the network is making a decision along the way, and the neurons from the final layer end up making the most important decision that we care about.

Architecture of a Neural Network with 2 hidden layers

So we have an architecture of the neural network, some inputs (I guess you have something you want to train), weights previously initialized and the output of the given network. Now you probably wonder how to force the network to make the right decision? How does it learn?

Let’s say the problem you chose to work on is classifying “а cat” and “not a cat”, because you either love or despise cats. We have a neural network with the architecture above, which means each input will be represented by three numbers. We choose an input that represents “not a cat” and we hit “play” on the network. Every neuron starts successively making a decision and then we reach the output neuron with the final decision e.g “а cat”. We know that this decision is wrong, so what can we change so the network can make the right decision?

The only thing we can do is to tune the existing weights and biases in a way that we are certain that the output in future will tend to the right decision. Changing weights and biases anywhere in the network will surely cause a little change in the output, that’s why we are confident that somehow we can make the network output “not a cat” for our current input.

Remember that when we train neural networks we are not focusing on one input only but all the training data available. I refer to one example, only for simplicity.

This is where learning algorithms, such as gradient descent and all its variations, come to hand. The goal is to help approximate the output y(x)for all training examples x. In order to know by how much we need to correct the weights, we need to calculate the inaccuracy made on the given output. That pushes us towards the next step, finding the error using the so-called cost function, in this case, the quadratic cost function.

The equation only points to the error in the final layer, where the decision about the input being “a cat” and “not a cat” is made. But we said earlier that this decision depends on other smaller decisions in previous layers, and supposedly those neurons make mistakes too.

As I said before, in order to change the weights and biases for each layer in the network, we need to make sure we know how these changes will affect the output. That’s the main task of the gradient in gradient descent, which in mathematical terms is nothing but a vector of partial derivatives. This tells us we are supposed to find partial derivatives of weight and bias terms.

Logically, we’ll start by inspecting how the changes in the last layer affect the error.

This means that as the predicted value changes, the error will change at the rate equal to the difference between the predicted and the real value. This makes sense and we would have come to this conclusion without equations, but as we delve deeper into the network it’s difficult to make these conclusions.

So if we don’t want to mistake “a cat” with “not a cat”, we shall make this error meet its minimum value or in other words the derivatives of weight and bias terms should be with respect to the error (we shall see how the changes in weights and biases are affecting the error).

Next, we should inspect how the weights part of the connections to the last layer (pointed at the picture below) affect the error. This means we should find how the weights and bias terms are connected to the error. When we are calculating the output of the network, we use the same equations for calculating the output of every other neuron in the network.

Connections to the output layer responsible for part of the error in the output

From there we see that the weights and biases are part of the z term, which means we can calculate the partial derivatives for the weight and bias term.

If you don’t know where these derivatives come from, remind yourself at the beginning of the post.

But what these derivatives calculate is the effect the weights and bias terms make on the z term, not on the error. How can we calculate the effects on the changes in weight and bias terms on the error? Well, the answer is, we need to calculate these partial derivatives of the weight and bias terms with respect to the error.

Use the chain rule to calculate these partial derivatives. By doing this we will come to the full form of the derivative.

We continue calculating the derivatives of all weight and bias terms until we reach the first layer and then we update the weights and biases using gradient descent. As we can see in the derivatives above, the error is part of these derivatives and all derivatives that are yet to be calculated. The algorithm is called backpropagation because it backpropagates the error to all partial derivatives calculated. Tadaaa!!! 🎉

This blog post only explains the mathematics behind backpropagation, but it is not enough without knowing the gradient descent. Next time I will focus more on gradient descent. Until then there are lots of great resources on the Internet.

Whenever I learn something new in the AI/ML world, my mentor keeps reminding me of something that I’d like to share with you: Don’t spare time by skipping the math. Do it by hand if needed!

If you need more material on backpropagation and gradient descent, you should read the links below.

A Step by Step Backpropagation Example by Matt Mazur
CS229 Lecture notes by Andrew Ng

How did I understand backpropagation?

💡 Remind yourself!

The input of the activation function is actually the sum of weighted connections from the previous layer, presented with the second equation. The equations above are vectorized.

Written by Simona Ivanova