Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Courage to Learn ML: Explain Backpropagation from Mathematical Theory to Coding Practice

Amy Ma
19 min readJan 17, 2024

--

Image created by the author using ChatGPT.

What is backpropagation and how is it related to gradient descent?

So the hard part of training DNN is due to multiple layers. But, before talking more about backpropation, I’m curious why DNNs typically go deeper rather than wider. Why aren’t shallow but wide networks popular?

Now, let’s explore backpropagation in detail and utilize the code snippets as a resource to deepen our understanding of the concept.

We will study backpropagation based on this network. Source: https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s

What is the chain rule?

Image created by the author.
Imagine derivatives as streams of water in a landscape. Image created by the author using ChatGPT.
This image shows a simple but incorrect way to understand the chain rule. Image created by the author.

How is the chain rule applied to our calculation of ∂L(θ)/∂w1?

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

How do we compute ∂z/∂w1, which is the partial derivative of z with respect to the weight that directly determines it in a linear fashion?

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

So, the calculation of ∂z/∂w1 isn’t actually a “backward” process. It’s simply equal to the input of the neuron. How about the partial derivative of the activation output with respect to its input, ∂a/∂z?

Considering our discussion thus far, when computing ∂L/∂w1, two of the three components do not require loss information or input from later layers. Then why do we refer the calculation process as ‘backpropagation’?

Imagine the entire neural network as a cake factory’s production line. Image created by the author using ChatGPT.
Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.
This is key to understanding why the process is named as backpropagation.

So, we recursively compute ∂L/∂z for each layer backwards. Then, how do we calculate the partial derivatives of the loss with respect to the output layer’s z, which serves as the first value of ∂L/∂z to start the process?

Our discussions so far about backpropagation have focused on using SGD (Stochastic Gradient Descent) with a batch size = 1. What if we use a larger batch size for training, would that alter the calculations?

Based on our discussion, how to combine these elements to construct the backpropagation calculation? How to use our insights to decode this code snippet?

# Transfer neuron activation
def transfer(activation):
return 1.0 / (1.0 + exp(-activation))

# Calculate the derivative of an neuron output
def transfer_derivative(output):
return output * (1.0 - output)

Calculate ∂L/∂a.

def backward_propagate_error(network, expected):
for i in reversed(range(len(network))):
layer = network[i]
# calculate the loss forr each layer
if i != len(network)-1:
...
else:
# ∂L/∂a of the output layer
for j in range(len(layer)):
neuron = layer[j]
errors.append(neuron['output'] - expected[j])
# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
for i in reversed(range(len(network))):
layer = network[i]
errors = list()
# calculate the loss forr each layer
if i != len(network)-1:
...
# ∂L/∂a of the hidden layer
for neuron in network[i + 1]:
error += (neuron['weights'][j] * neuron['delta'])
errors.append(error)
else:
...

Calculate ∂L/∂z.

# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
for i in reversed(range(len(network))):
layer = network[i]
errors = list()
# calculate the partial derivative of the loss with repsect to the input before activation
for j in range(len(layer)):
neuron = layer[j]
neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])

Calculate ∂L/∂w to adjust weights.

# Update network weights with error
def update_weights(network, row, l_rate):
for i in range(len(network)):
inputs = row[:-1]
if i != 0:
inputs = [neuron['output'] for neuron in network[i - 1]]
for neuron in network[i]:
for j in range(len(inputs)):
neuron['weights'][j] -= l_rate * neuron['delta'] * inputs[j]
neuron['weights'][-1] -= l_rate * neuron['delta'] # bias term the input is 1

You know, backpropagation kind of like the reverse of the forward pass, doesn’t it? Is that a fair way to look at it?

Why is it important for us to understand the intricate details of backpropagation’s calculations?

How does backpropagation inform the choice of activation functions?

Activation Functions and their Derivatives. Source: https://dwaithe.github.io/images/activationFunctions.png
Source: https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Vanishing-and-Exploding-Gradients-in-Neural-Network-Models-Debugging-Monitoring-and-Fixing-Practical-Guide_7.png?resize=636%2C497&ssl=1

Reference:

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Amy Ma
Amy Ma

Written by Amy Ma

Tech, life, and the chaos in between—fueled by curiosity, caffeine, and a toddler 🍼☕🐾 Want more? My newsletter -https://theamyma101.substack.com

Responses (1)