Deep learning from scratch 2: backpropagation

Niels Verleysen
5 min readSep 14, 2022

--

In the previous blog, we learned how a neural network is structured and can make a prediction through the forward pass. To do so we manually filled in the weights and biases, but in practice this is not really possible. The strength of machine learning lies in its ability to learn the mapping between inputs and outputs by itself.

So, we will continue our story by giving the neural network its ability to learn from examples. As with the other blogs, there is a notebook available here where you can dive a bit deeper into this topic.

Photo by 🇸🇮 Janko Ferlič on Unsplash

How to improve?

A neural network learns from examples. This learning comes down to changing weights and biases to better model these examples. But how can the network know what the new weights and biases should be?

Each example comes with a desired output. This desired output can be compared this with the prediction of the network. This is done through something called the loss. The loss can be seen as a metric telling us how well a machine learning model is performing. We want this loss to be as small as possible. When learning, the weights and biases should therefore be changed in such a way that the loss decreases. The gradient does just that, pointing us towards the most downward slope around the current set of weights and biases. By then taking a step along this direction (updating weights and biases) the network improves.

A simplified example with two weights. The weights are updated by following the steepest slope, resulting in a decreased loss after each step.

Backpropagation

So, to improve the model we need to know the gradient of some examples and then update the weights and biases accordingly. Of course this is easier said than done. Weights and biases in earlier layers have an impact on the inputs of later layers. One idea could be to fix all weights and biases and then compute the gradient for one at a time. For a small network this won’t be too big an issue. The real issue arises once you start adding more layers and neurons, with a serious impact on computational complexity.

The solution lies in using the chain rule and computing the gradients layer per layer from back to front. This is the backward pass in the backpropagation algorithm.

I will leave the mathematical proof for what it is. If you are interested, there are some great resources out here on the internet. In its essence the backward pass boils down to the following:

  1. We initialize the gradients with the difference between the observed and desired outputs.
  2. We pass the gradients backwards through the activation function: gradients := gradients * activation’ (forward input of this activation)
  3. These are the gradients of the next layer
  4. We then pass the gradients through this next layer: gradients := gradients dot W^T
  5. Repeat from step 2 until we have gone through the whole network

When coding this up we get something like this:

# dZ = dA * act'(Z)
self
.gradients = loss * self.activations[-1].derivative_activation()
self.layers[-1].update_gradients(self.gradients)
# Loop through other layers from back to front
for i in range(len(self.layers) - 2, -1, -1):
# dA = dZ dot W
self
.gradients = np.dot(self.gradients,
self.layers[i+1].weights)
# dZ = dA * act'(Z)
self
.gradients = self.gradients * \
self.activations[i].derivative_activation()
self.layers[i].update_gradients(self.gradients)

I store the gradients inside the layers with the update_gradients function. The next step is to actually update the weights and biases. With the gradients we know in what direction we need to take a step. We define the size of this step as the learning rate α. It then comes down to the following formulas:

W := W - α * ((gradients ⋅ inputs) / number of inputs)

B := B - α * mean gradients

Or in python:

self.weights -= learning_rate * (np.dot(self.gradients.T,
self.inputs) / self.input_size)
self.biases -= learning_rate * np.mean(self.gradients, axis=0)

And that’s it. The backpropagation algorithm implemented only using core python and numpy!

An example

To show that this actually works we can quickly take a look at an example. Let’s say we have a trapezoidal function. We can sample points from this function as training examples and some other, but different, points for testing the performance.

We then create a small neural network to learn this function and give it a learning rate. Figuring out the right architecture and learning rate requires some trial and error. The learning rate cannot be too large, otherwise the training process can become very unstable. On the other hand, selecting a learning rate that is too low might cause the training process to get stuck on a suboptimal solution.

I selected a network with five layers. Maybe a bit excessive for this example, but who cares? Here is the code for training:

network = Network([Dense(1, 32), Dense(32, 32), Dense(32, 16),
Dense(16, 16), Dense(16, 1)],[Sigmoid(), Sigmoid(), Sigmoid(),
Sigmoid(), Tanh()], 0.001)
losses = []# Training loop
for epoch in range(500):
y_pred = network.forward(x_train)
loss = rmse(y_pred, y_train)
losses.append(loss)
network.backward(y_pred - y_train)
network.optimize()
# Visualize training result
plt.plot(losses)
plt.ylabel('RMSE')
plt.xlabel('Epoch')
# Test the network
y_hat = network.forward(x_test)
loss = rmse(y_hat, y_test)
print(loss)

Below you can see the results of some of my trials. I started with a rather high learning rate (note the instability of the loss) and decreased it until I got an acceptable result. An important remark to make here is that a lower learning rate requires more training loops or epochs. As you take smaller steps it takes longer to get to your destination.

Training loss over time for a decreasing learning rate. From left to right the learning rates are 0.001, 0.0001 and 0.00001. You can see that only the rightmost case shows a stable loss decrease.

In the end I got the following result, which isn’t too bad. The neural network clearly learns from the examples, with a final RMSE of 0.065. It did require 250 000 epochs, which evidently took some time. Of course this is not the standard, neural networks in academics and industry have to learn exponentially more difficult mappings without taking a few millennia of training.

Predictions versus true values of the test set. The neural network has clearly learnt to model this function, but it is not perfect yet.

So what can we do differently? Let’s apply some techniques to improve the training process. In the next blog we use minibatches and the Adam optimization algorithm for incredible improvements.

Thanks for reading! If you want to try this out for yourself, you can take a look at the notebooks here. This is my second deep learning from scratch post, which continues on an earlier post you can find here. Happy learning!

--

--

Niels Verleysen

Senio Data Scientist @ Verhaert, performing applied datascience research for digital & physical product development.