# Implementing Backprogation with Plain Python Lists: It’s simple!

Published in

Sources: I learned this material with these exercises with the Deep Learning course at the Vrije Universiteit in Amsterdam. Check https://dlvu.github.io/ and the YT-channel of the course for the course material.

Although in these days a lot of high-level frameworks exist for deep learning, understanding the basics is still really important for implementing neural networks. A lot of tutorials exist on the internet for implementing neural networks with NumPy. In this post, we will take it one step further: we will implement a neural network only using Python lists and the Math library. Let’s dive in!

The neural network which we will implement is visible in figure 1. In this network, we have:

• 2 inputs x₁ and x₂ with one bias b₁.
• These inputs flow via a fully connected linear layer with weights w to the first hidden layer which exists of k₁, k₂ and k₃. For k₁, the mathematical expression would look like:
k₁ = w₁₁ ⋅ x₁ + w₂₁ ⋅ x₂ + b₁
• The k values are going through a non-linearity. We will implement a sigmoid non-linearity. The output we name h. The expression for h₂ looks like:
h₂ = 1/(1+ exp(- k₂))
• The h values plus bias b₂ then flow via the second fully connected linear layer with weights v to produce the output value y. This looks like:
y = v₁ ⋅ h₁ + v₂ ⋅ h₂ + v₃ ⋅ h₃ + b₂
• This output value y is then compared with the target output t via the Mean Squared Error loss computation to get the loss l:
l = (y - t)²
• The loss is then used to update the values of our network via backpropagation using the chain rule which we will now code!

Let’s use pseudocode to first implement and understand forward pass using Python lists.

`# The only library we use is the math library:import math# The parameters: x = [x_1, x_2]w = [w_11, w_12, w_13;      w_21, w_22, w_23]b = [b_1, b_2]v = [v_1, v_2, v_3]t = target# The sigmoid function:def sig(k):   return 1/(1+math.exp(-k))# The first layer:k = [0.0,0.0,0.0]for j in range(3):   for i in range(2):      k[j] += w[i][j]*x[i]   k[j] += b[0]# The second layer (non-linearity):h = [0.0,0.0,0.0]for i in range(3):    h[i] += sig(k[i])    y = 0.0for i in range(3):   y += v[i]*h[i]y += b[1]# The MSE loss:l = (y-t)**2`

Now that the forward pass is complete, let’s calculate the backward pass via the chain rule. All values get updated with respect to the loss, which means that the chain rule can be simplified: We can walk backward from the computation graph in figure 1 and compute the derivative of the loss for every node. For the nodes earlier in the network, we just multiply the local gradient. This means we can very efficiently compute any derivatives we need.

For calculating the derivative of for example v₂ with respect to l, we need the derivative of the loss l with respect to y multiplied with the derivative of v₂ with respect to y.
dv₂_dl = dl_dy ⋅ dy_dv₂

Recall that the loss function was:
l = (y — t)²
And that the linear layer to calculate y was:
y = v₁ ⋅ h₁ + v₂ ⋅ h₂ + v₃ ⋅ h₃ + b₂
By calculating the derivatives, we get:
dv₂_dl = dl_dy ⋅ dy_dv₂
dv₂_dl = 2 ⋅ (y - t) ⋅ h₂

For b₂, we have:
db₂_dl = dl_dy ⋅ dy_db₂
Which gives:
db₂_dl = 2 ⋅ (y - t) ⋅ 1

We can now update these weights and biases with learning rate α:
v₂ = v₂ - α ⋅ dv₂_dl
b₂ = b₂ - α ⋅ db₂_dl

The derivatives of v with respect to l can be used to calculate derivatives earlier in the network, which in our case are the weights w and bias b₁. For every weight, we calculate all derivatives in the path of that weight to the loss l. For the weight w₁₂, we have the path as in figure 2.

All by all, we get:
dw₁₂_dl = dl_dy ⋅ dy_dh₂ ⋅ dk_dh₂ ⋅ dk_w₁₂

Of which dl_dy we already calculated,. The other derivatives are:
dy_dh₂ = v₂
For dk_dh₂, we need the derivative of the sigmoid function:
h₂ = 1/(1+ exp(- k₂)) = sigmoid(k₂)
dk_dh₂ = sigmoid(k₂) ⋅ (1 - sigmoid(k₂))
Lastly, we need the derivative dk_w₁₂. Note that:
k₂ = w₁₂ ⋅ x₁ + w₂₂ ⋅ x₂ + b₁
So:
dk_w₁₂ = x₁

Filling in the derivative in the full path:
dw₁₂_dl = 2 ⋅ (y - t) ⋅ v₂ ⋅ sigmoid(k₂) ⋅ (1 - sigmoid(k₂)) ⋅ x₁

Lastly, we calculate the derivative of b₁, which is influenced by k₁, k₂ and k₃! We need to add up all the derivatives.
db₁_dl +=
2 ⋅ (y - t) ⋅ v₁ ⋅ sigmoid(k₁) ⋅ (1 — sigmoid(k₁)) ⋅ 1
db₁_dl +=
2 ⋅ (y - t) ⋅ v₂ ⋅ sigmoid(k₂) ⋅ (1 — sigmoid(k₂)) ⋅ 1
db₁_dl +=
2 ⋅ (y - t) ⋅ v₃ ⋅ sigmoid(k₃) ⋅ (1 — sigmoid(k₃)) ⋅ 1

Now we have mathematically expressed all the derivative values. Finally, we will program this!

`dl_dy = 2*(y-t)dl_dv = [0.0, 0.0, 0.0]dl_db = [0.0, 0.0]for i in range(3):   dl_dv[i] += dl_dy*h[i]   dl_db[1] += dl_dydl_dw = [0.0, 0.0, 0.0;          0.0, 0.0. 0.0]for j in range(2):   for i in range(3):      dl_dw[i][j] += dl_dy * v[i] * sig(k[i])*(1 - sig(k[i])) * x[j]for i in range(3):   dl_db[0] += dl_dy * v[i] * sig(k[i])*(1 - sig(k[i]))`

Finally, all you need to do is calculate the updated values with your learning rate α and you have implemented backpropagation in a neural network using only plain Python!

Try for yourself if you can implement the network in figure 3 below, which consists of a linear layer with weights W and biases b, followed by sigmoid non-linearity, then again an linear layer with weights V and biases c and lastly a softmax layer. Calculate the loss with respect to the target with cross-entropy loss. With the initialization of weights and biases as in figure 3, the derivatives after one forward and backward pass should look like:

derivatives wrt W: [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
derivatives wrt b: [0.0, 0.0, 0.0]
derivatives wrt V: [[-0.4404, 0.4404], [-0.4404, 0.4404], [-0.4404, 0.4404]]
derivatives wrt c: [-0.5, 0.5]

I hope this post helped you to better understand backprogagation. Good luck on your journey in Deep Learning!

--

--

Master graduate AI @VU Amsterdam. Currently learning and writing about building brain-computer interfaces. Support me: https://timdb877.medium.com/membership