Backpropagation Algorithm Part 1

Herman Van Haagen
4 min readMay 28, 2023

--

a perceptron, the building block of a neural network

Introduction to learning

In this series of lessons, we will learn how the backpropagation algorithm works. Backpropagation is used to train neural networks, and it is also employed in Deep Learning. Since this topic can be quite complex for some people (non technical people with no math background), we will build up our understanding gradually. The following subjects will be covered over several lessons:

  • the derivative
  • the partial derivative
  • the chain rule
  • gradient descent
  • loss function
  • epoch
  • weight initialization
  • learning rate
  • update rule
  • Matrix notation (efficiency)

But as mentioned, let’s start with something simple. Let’s consider a sequence of numbers:

[8, 5, 4, 8, 3, 6, 3, 3, 1, 8, 7, 1, 6, 8, 7, 4, 6, 4, 4, 3]

And we want to calculate the average. You can do this manually or with a calculator, and you will find that it is 4.95.

Now, we will calculate the same average using a very circuitous algorithm. At first glance, this algorithm may seem cumbersome, but it forms the basis of how neural networks are trained. We will also explain why it works and its advantages.

The algorithm starts with an initial guess of what the answer might be. Let’s choose a value far from the actual answer to demonstrate the power of the algorithm. The initial guess is:

average = 100

Now, we will iterate through the numbers one by one and see how far off we are. The first number is 8, and we are off by 100–8 = 92. Alternatively, in formula form:

error = average — number

We can correct the result by subtracting the error from the original value.

So, 100–92 = 8.

However, 8 is not the answer but a part of a series of numbers that together form the average. Let’s correct only a fraction of the error. For example,

100–0.1 * 92 = 90.8.

Here, we call 0.1 the learning rate (lr), and it is a number between 0 and 1. In formula form, we now have the following:

new_average = average — error * lr

We refer to this formula as the update rule. Essentially, it updates the average we are trying to calculate. We repeat these steps, moving on to the next number in the sequence, which is 5. We calculate the error again:

90.8–5 = 85.8

We again use a fraction of the error to correct it:

90.8–0.1 * 85.8 = 82.22

And we continue this process. Now, we encounter a small problem: at some point, we run out of our numbers. When we reach the last number, 3, we no longer have any ‘data’. But fear not, because we simply start over again with the number 8 and reuse the ‘data’. Going through the data once is called an epoch.

Congratulations!! You have already learned a significant portion of the backpropagation algorithm. The following elements have been discussed: initial guess (start value), error, learning rate (lr), and epoch. We can write this in Python code, as follows:

Exercise

Copy this code into your own notebook and check if it works. You probably have a few questions:

• How do I choose the learning rate (lr)?

• How do I choose the number of epochs?

• How do I choose the initial guess?

To answer these questions, you need to keep track of the learning process. Take a look at the following code and add the extra lines (trainresults):

With this, you can observe how the error decreases as you train for a longer period. The progression can look like this, for example:

When it comes to the learning rate, if you choose it too large, it won’t converge to the correct answer (average of 4.95). If you choose it too small, it will take an eternity to reach the answer.

Regarding the number of epochs, if you choose it too large, the training may unnecessarily take a long time. As seen above, the training stabilizes after 50 iterations. Training for longer has no effect. Setting the number of epochs also depends on the size of the dataset. If you have a very large dataset (>10,000, such as MNIST), a small number of epochs will suffice, for example, 10. If you have a small dataset like Iris (150 instances), you will need a large number of epochs.

Setting the learning rate and epochs is somewhat trial and error. You need to adjust them by visualizing the training process and gaining experience.

You can find the code snippets (jupyter notebooks) on my github page.

Exercise

• Try to achieve an error smaller than 0.01 with the fewest possible epochs and the correct learning rate. That is, 4.95±0.01.

• Try changing the initial guess value. For example, make it negative (-100). Does the algorithm still work?

This concludes part 1. In part 2 we will discuss Gradient Descent.

--

--

Herman Van Haagen

Data scientist by profession. Deep Learning & AI. Based in The Netherlands. Follow my blog on Machine Learning and AI tutorials