Backpropagation algorithm part 2

Herman Van Haagen
4 min readJun 19, 2023

--

Gradient Descent

In the first part, we saw how to calculate an average using a simple update algorithm. You may wonder why this works at all. Implicitly, we made two assumptions that we will explain here. The first assumption is that the average is the best prediction (or the minimal error). This is a given in statistics:

When there is no independent variable X, the average of the dependent variable Y is the best prediction you can make (think of linear regression).

The second assumption is the way we defined the error:

error = average — score

But what if we had done it the other way around:

error = score — average

Would it have worked too? The answer is no. Here we used the quadratic error. It is defined as follows:

error² = (average — score)²

Why squared? Take a look at the following graph of a quadratic function.

A quadratic function f(x)=x²

A quadratic function is also known as a parabola. As you can see, it has a minimum at 0. More generally, every quadratic function has a minimum. This is the minimum error we can achieve. Let’s combine the first statement about the average with this quadratic error. We’ll use the same sequence of numbers from lesson 1:

[8, 5, 4, 8, 3, 6, 3, 3, 1, 8, 7, 1, 6, 8, 7, 4, 6, 4, 4, 3]

And we’ll calculate the quadratic error for a certain value. If we take the starting value as 100, the error is defined as follows:

error = (100–8)² + (100–5)² + (100–4)² + … + (100–3)²

You can see that we take the differences squared for all the numbers, then sum them up. Now we can vary the starting values and see at which value the error is minimal. The script for this is as follows:

Finally, we can plot all the quadratic errors against the averages (or, more accurately, the predictions of the averages). This can be seen in the following graph:

We achieved this result earlier with the simple algorithm from lesson 1. Now we will explain why it works. We used Gradient Descent as a technique.

The derivative (the basis for Gradient Descent)

The derivative of a function is a way to determine the tangent line to the graph at a certain point. This is shown in the figure below:

The tangent line is defined as:

y = ax + b

where a is the slope and b is the intercept (the point where the line intersects the y-axis). The function is

f(x) = x².

This is a parabola. The tangent line is y = -3x — 2.25. And this tangent line applies to the point x = -1.5. How do we get this result?

First, we calculate the derivative of the function. It is

f’(x) = 2x.

Now we substitute the point x = -1.5. This gives us f’(-1.5) = -1.5 * 2 = -3. This is the slope of the tangent line. So a = -3. With the point x, we can calculate the intercept b. All of this can be found in the Python script on my github page.

The derivative is useful because it points us in the direction of the minimum. For example, if we had chosen the point x = 2 for the quadratic function (parabola), the derivative would be f’(x) = 2 * 2 = 4 (instead of -3).

The derivative helps us decrease the error in the right direction. You can also think of the parabola as a bowl in which you roll a ball. Due to gravity, the ball rolls toward the minimum.

Now let’s program gradient descent in Python. It is nearly the same as in lesson 1 for calculating the average. However, we have added the derivative to the code.

The complete code can be found on my github page.

Let’s go back to the original example from lesson 1, where we calculate an average.

[8, 5, 4, 8, 3, 6, 3, 3, 1, 8, 7, 1, 6, 8, 7, 4, 6, 4, 4, 3]

We chose the starting value as 100. Essentially, what we did was calculate the quadratic error. For the first number, it becomes (100–8)². And its derivative is 2*(100–8). We use this in the update rule with a small fraction by the learning rate.

Congratulations! You are one step closer to learning backpropagation. In addition to the terms epoch, learning rate, and update rule, you have now also learned about gradient descent. We have had two lessons so far and now know how to calculate an average using Gradient Descent. In the next lessons, we will explore more derivatives and the chain rule.

--

--

Herman Van Haagen

Data scientist by profession. Deep Learning & AI. Based in The Netherlands. Follow my blog on Machine Learning and AI tutorials