What Makes Backpropagation So Elegant?

Part 4/4 of the Deep Learning Explained Visually series.

Published in

The Feynman Journal

14 min readAug 23, 2021

The full animation for backpropagation which we’ll build up to in this article.

A quick note for SoME1 readers: This article is the final part of a 4-part series and it is my submission for SoME1. The other parts are not part of the submission, as they were previously published. This article is meant to be stand-alone, so please judge it without the context of the first 3 articles in the series.

Why Backpropagation Matters

Welcome to the final part of the Deep Learning Explained Visually series! In this article, we’ll tackle backpropagation. But before we jump into how it works, let’s first talk about why it’s such an important algorithm.

For over a decade, backpropagation has been the most successful learning method for deep neural networks. So it’s not a stretch to say that backpropagation is responsible for some of the most significant breakthroughs in cutting-edge domains such as computer vision, natural language processing, and reinforcement learning.

Not only is backpropagation unreasonably effective, its design elegantly combines mathematical theory and computational practicality. Most ideas manage to achieve only one of these two things — they’re either mathematically appealing but computationally infeasible, or they’re computationally efficient but lacking in theory. And I think it’s worthwhile to look at an example that achieves both with flying colors.

Although many key optimizations have been added to backpropagation over the years to enhance its performance, the core algorithm remains the bread and butter of how deep neural networks learn to improve their predictions. So our focus will be to develop a strong intuition for backpropagation without the frills.

What Backpropagation Looks Like

In part 3, we visualized what the learning process looks like for a deep neural network (specifically, a Multilayer Perceptron or MLP):

Left: The output landscape as the MLP learns. Right: The output of the last hidden layer as the MLP learns.

But we didn’t get a chance to talk about how the learning happens. And indeed, the MLP in part 3 was using backpropagation to fit the training data. Over the course of the following sections, we’ll see how backpropagation translates to the learning process shown above.

It’s important to realize that most people find the math behind backpropagation to be quite intimidating when they encounter it for the first time (I know I certainly did). But the good news is that we can approach it with some prior intuition to make it easier to understand. We’ve done a lot of this groundwork in part 1 already, so let’s do a quick recap.

The Perceptron Learning Algorithm, but Better

In the Perceptron algorithm, we computed the dot product between a single weight vector and the input vector to make a prediction. If the prediction was wrong, we simply added/subtracted the input vector to/from the weight vector. The goal was to increase (by adding) or decrease (by subtracting) the dot product between the weight vector and the input vector.

The learning process for a single Perceptron, where a scaled version of the input vector (blue/red) is being added to or subtracted from the weight vector (orange) at each step.

And because we wanted to improve our predictions more carefully, we scaled the input vector by a small learning rate before using it to update the weight vector. This worked out pretty well for the Perceptron. But the MLP doesn’t just have a single weight vector — it has many weight matrices, each one made up of many weight vectors. So this simple strategy of scaling the input vector by the same number — the learning rate — isn’t going to be all that effective.

Still, it would be nice to adjust the weight vectors in the MLP using a similar approach. And as luck would have it, the only adjustment we need to make in order to adapt the Perceptron learning procedure for the MLP is to be smarter about how we scale the input vector before using it to update each weight vector.

The overarching goal would then be to update the weight vectors in a way such that the combination of all updates results in the MLP becoming more accurate in making predictions. But the term “accurate” is somewhat subjective, so we first need to concretely define what it means.

The Loss Function

In part 2, we saw that this is what the feedforward process looks like:

In a nutshell, feedforward is a series of matrix multiplications and nonlinear activations that ultimately results in a useful prediction. And to define what we mean by “accurate”, let’s start at the end:

For a binary classification problem, the neural network can output any value in the range [0, 1]. In the above case, the MLP has made a prediction of 0.55. But let's say that the target label for the input feature vector is 1. That means our MLP missed the mark, and by quite a significant margin.

We can quantify how much the MLP “missed the mark” by using what’s known as a loss function. Defining the loss will allow us to come up with a plan for reducing it. Having a prediction y and a target value t, we'll use a loss function that looks like this:

L(y, t) = -(t * log(y) + (1 - t) * log(1 - y))

At first glance, the loss function might seem a little complicated. But we can use the fact that binary classification only has two possible targets — positive (1) or negative (0) — to simplify the function. When we consider each of these two mutually exclusive scenarios in isolation, we see that it's really not that complicated:

If the target is 1, our loss function will be L(y, 1) = -log(y).
If the target is 0, our loss function will be L(y, 0) = -log(1 - y).

And to get a better feel for what the loss function looks like, we can try plotting the two cases:

The blue curve is what the loss function looks like when the target is 1. The red curve is what the loss function looks like when the target is 0. In both cases, we see that as the prediction y gets farther and farther away from the target, the loss increases significantly.

This gives us a solid definition of what “accurate” means. Now we need a way to orchestrate the weight vectors so that they’re all collaborating towards the common goal of reducing the loss. To do this, we need to hold each weight vector accountable for its part in creating the gap between the expected output t and the actual output y.

For each bad prediction, some weight vectors will be more responsible for this gap than others, so we want to capture the relative magnitudes of their mistakes through a number. And then we’ll use this number to scale the weight vector updates, which will allow each weight vector to proportionally compensate for its mistake.

Looking at the shape of each curve, you might imagine placing yourself on the curve at a point where the loss is high and sliding down to a point where the loss is low. That seems pretty easy to do. But how can we tell a mathematical variable to “slide down” the slope? Well, we know that the slope is just the derivative — and to compute the derivative for each variable in a multilayered function, we can use the chain rule.

The Chain Rule

To understand how the MLP might use the chain rule to update its weight vectors, let’s look at another multilayered function that’s simpler than the MLP: h(x) = sin(x^2). We can simplify this a little by breaking the function down into two layers:

f(x) = x^2
g(y) = sin(y)
h(x) = g(f(x)) = sin(x^2)

This isn’t exactly a deep neural network, but it sort of behaves like one: The input is transformed at each layer and the output of one layer is used as the input for the next layer. x is first being passed into f() and the result of that is being passed into g() to get the final output.

Using derivatives, we know how to get the rate of change of f(x) with respect to the input x. And we know that if the derivative is positive, it means that an increase in x will result in an increase in f(x). If it's negative, then an increase in x will result in a decrease in f(x).

By applying the chain rule, we can compute the rate of change of the final output h(x) with respect to the raw input x. Setting f(x) = u and h(x) = g(f(x)) = g(u) = v means the derivative of the output with respect to the raw input would be dv/dx. The chain rule allows us to simply multiply the derivatives of each layer with respect to its input, which means we get dv/dx = dv/du * du/dx = g'(u) * f'(x) = cos(x^2) * 2x.

And this multilayered derivative follows the same rules as any old derivative — if it’s positive, that means an increase in x will result in an increase in h(x). If it's negative, that means an increase in x will result in a decrease in h(x). This allows us to know how h(x) will change when we nudge the raw input x in either direction, which means we can exert some control over h(x) by changing x.

It’s important to realize that the MLP is just a slightly more complicated multilayered function compared to h(x). We can apply the chain rule and nudge its variables — the weight vectors — towards minimizing its final output — the loss. And this is what backpropagation ultimately boils down to.

Backpropagation = Chain Rule

Now, we could take the standard approach of understanding backpropagation which is to expand the mathematical equation and compute the derivative of the loss with respect to each weight variable. But that approach is prone to getting lost in the weeds, so I want to take a slightly higher-level approach that I think will provide you with a much stronger intuition of how backpropagation works.

The first step in backpropagation is to compute what we’ll call the error delta. The error delta δ for each neuron is the derivative of the loss with respect to the neuron’s input. It asks the question: How would the loss change if the input into the neuron were to change?

And remember, the input into a neuron is just the dot product between its weight vector and the input vector at the neuron’s layer, where the dot product measures the alignment between two vectors. This means we can change the weight vector with an understanding of how it will affect the dot product, and therefore the loss.

Using the chain rule, the error delta for the output neuron turns out to be δ3 = y - t = -0.45. Here, δ3 being negative means that if the dot product into the neuron were to increase, that would cause the loss to decrease.

If we apply the chain rule again to compute the derivative of the loss with respect to the weight vector, that’s simply the derivative we’ve computed thus far, δ3, multiplied by the input vector at the final layer, H2. This gives us ΔW3.

What ΔW3 tells us is that adding a scaled version of the input vector H2 to the weight vector W3 would result in a greater loss. Since our goal is to slide down the slope of the loss function, we just have to do the opposite of that and subtract ΔW3 from W3. But to make our update more careful, we first scale ΔW3 by a small learning rate.

The amazing thing about this operation is that it’s almost exactly like the Perceptron learning procedure. In the Perceptron learning procedure, the sign of the output tells the weight vector to be more/less like the input vector to increase/decrease the dot product:

A Perceptron learning step where a scaled version of the input vector (blue) is added to the weight vector (orange) to increase their alignment and therefore increase their dot product.

In backpropagation, the sign of the error delta tells the weight vector the same thing — whether to add/subtract the input vector to/from the weight vector. But the error delta also has a magnitude, which captures how much we should increase or decrease the dot product by. So what we have here is like the Perceptron learning procedure except smarter thanks to the chain rule!

Now we see how useful the error delta can be. And to compute the error deltas for neurons in the previous layers, we simply use the chain rule again. When we do this, we see that it’s the error delta δ3, multiplied by the weight vector W3, multiplied by the derivatives of the nonlinear activations σ'(Z2), giving us δ2:

A hollow circle means element-wise multiplication.

This is where backpropagation completely blows my mind. Thanks to the multiplicative nature of the chain rule, the process by which the error delta is backpropagated to compute more error deltas mathematically works out to be a near-perfect reflection of the feedforward process.

In feedforward, we multiply the input vector by the weight matrix then pass the result into a nonlinear activation to get the output vector. In backpropagation, we multiply the error deltas for the output vector by the transpose of the weight matrix, then multiply the resulting vector by the derivative of the nonlinear activation to get the error deltas for the input vector.

There is this elegant symmetry between feedforward (inference) and backpropagation (learning) that lies at the heart of deep neural networks. It’s fascinating that such a mathematically symmetric process can exhibit artificially intelligent behavior.

If this seems too good to be true, I encourage you to try working out the math for yourself! It’s a good exercise and you’ll get a better feel for how the details manifest into the overall process.

Now that we have a new set of error deltas, we can use each one to update its corresponding weight vector:

Note that because ΔW2 is an update matrix, we can simply scale the whole matrix by the learning rate and subtract it from the weight matrix at this layer. As a side note, since matrix operations are what computers are naturally good at (particularly GPUs), we can see a huge efficiency gain by using matrix operations in both feedforward and backpropagation.

Moving on, the chain rule tells us to multiply the error deltas at the current layer, δ2, by the weight matrix W2 and the nonlinear derivatives σ'(Z1) to compute the error deltas for the previous layer, δ1. Because this layer involves a matrix multiplication, it makes it even more obvious how much backpropagation reflects feedforward:

Here’s an additional perspective that you might find useful: If we imagine that each step in feedforward is like the input neurons “voting” on what each output neuron should be, then each step in backpropagation is like the output neurons voting on what the input neurons should have been to achieve a lower loss.

Another interesting thing to note here is how each error delta is scaled by its corresponding nonlinear activation derivative. This has quite a significant implication: The nonlinear activation you choose will directly affect how error deltas are backpropagated through the neural network.

Remember that a weight vector “detects a feature” by measuring its alignment with the input vector via a dot product.

When using sigmoid activation, the derivative flattens out for very positive and very negative dot products, which means that the learning slows down. This is why neural networks using a sigmoid activation often struggle to achieve higher accuracies.

For ReLU activation, the derivative is always 1 for positive dot products and always 0 for negative dot products. This means that while the learning is faster when the weight vector and the input vector are aligned, the backpropagation signal is completely nullified if the vectors are unaligned. This is why neural networks that use ReLU activation tend to suffer from “dead” neurons that never activate.

Every nonlinear activation comes with pros and cons and it’s important to think about these tradeoffs when you’re designing a neural network, particularly around how your design choices will impact backpropagation.

With that, we’ve reached the final layer of backpropagation, which is the first layer of the MLP. This time, computing the update matrix is shown as a one-step matrix multiplication to emphasize that all the update vectors can be computed simultaneously:

Then once again, we take that update matrix, scale it by the learning rate, and subtract it from the weight matrix at this layer.

That’s it! That’s backpropagation. It’s essentially performing the Perceptron learning procedure for each subnetwork in the MLP, except we’re smarter about the scaling factor thanks to the chain rule. We’re still adding/subtracting scaled versions of the input vector to/from the weight vectors at each layer, with the ultimate goal of reducing the loss.

An interesting outcome of this is that the combination of all weight vector updates is what we call gradient descent. If we consider the set of all weight vectors as one giant weight vector, then the overall update vector resulting from the combination of all individual weight vector updates gives us the direction of steepest descent when minimizing the loss. This process is repeated over and over again for many training examples until the neural network lands on a set of weights that results in a high accuracy overall.

The activation landscapes for the 2 hidden layers (leftmost and middle columns) and the output layer (rightmost column) of an MLP.

To come full circle, when we visualize the learning process for all the hidden neurons and the output neuron of the MLP from part 3, we can see how backpropagation translates to the neural network learning how to combine simpler basis functions into a more complex function.

Summary

The elegant symmetry between feedforward (left) and backpropagation (right).

In this article, we developed an intuition for how backpropagation works and explored the symmetry between feedforward and backpropagation. We saw how deep neural networks are a compelling amalgamation of mathematical theory and computational practicality.

It’s inspiring to think that simple mathematical operations such as the dot product and the chain rule can be combined to make something as complex and effective as backpropagation. It really goes to show that you just have to be creative with how you put the building blocks together.

Acknowledgments

A huge thanks goes out to the following people who made this series possible:

Jinoo Baek, a Machine Learning Engineer with whom I had many conversations about the ideas presented in the series. We originally collaborated on a presentation about deep learning, which eventually evolved into these articles.
Dustin Stansbury, a Data Scientist who was generous enough to share his expertise around deep neural networks, giving me confidence in the mathematical correctness of the series. Check out his writing at The Clever Machine.
Kejun Jiang, a Software Engineer and close friend of mine who patiently reviewed each article and gave me detailed feedback on where I could improve the clarity of my writing.
All of my wonderful Focusmate partners who provided an endless source of encouragement and accountability.
You, the reader!

All images are homemade and the source code can be found here.