Implementing Backpropagation from Scratch— Part 1: Fundamentals

13 min readDec 28, 2022

Introduction

So, you’ve heard about back-propagation, you have already used it countless times, and you thought now it’s time to traverse through the dark paths of what exactly is behind that magical black-box thingy. Ok, maybe that is not you, but in fact, that was exactly me some time not too long ago. Now, if that’s indeed not you, I may still have something for you in the practical implementation section. In this tutorial we will go through the theory of backpropagation and we will implement some basic functionalities from scratch in Python.

Why is it important to understand the mechanics?

Since you are already here, probably this section is obsolete, but in case anyone’s wondering, let’s briefly take a look. So, why is it important to understand the mechanics? A type of question that many data scientists debate for almost about anything in the field. I mean, data science is easy, you just throw a couple of copied-and-pasted lines of sklearn, tensorflow, you-name-it code in your script and you done, right? Well not so fast, especially when it comes to real-world data, this is when the real problems begin, but this is a story for another night (or article). So, let’s cut to the chase.

If you ever used Pytorch, you probably have created something custom and you probably are familiar with parameters like requires_grad. But how knowing what back-propagation does behind the scenes may help you build better models? Often times, while working with state-of-the-art technologies, you might be placed in a situation where you need to create some custom methods by yourself as no code is available yet or maybe you need to freeze some network layers to fine-tune others. In such situations, it could be insightful to understand how gradients are calculated and what it happens behind the scenes when you set requires_grad to False.

In general, having a good intuition about how each of the neural network components work, helps you understand what is wrong with the network not being able to properly converge and generalize or when loss is NaN (exploding gradients) or staling performance (network weights don’t change — vanishing gradients) or ultimately, if the data is the problem. Such knowledge might save tons of debugging hours as you would have an intuition of the problem that lets you efficiently track it down and handle it or properly combine components or even introduce novel methods yourself, especially if you are a researcher.

Theory of Backpropagation

Backpropagation allows neural networks to modify their layer weights in accordance with the expected output given a set of inputs by computing the gradient of the loss function. The calculation is performed in an efficient manner, contrary to a naive approach where the gradient is computed individually for each weight. As a result, we are allowed to optimize the weights via gradient methods to minimize the loss to train deeper multilayer networks efficiently. Some commonly used variants of gradient descent are mini-batch gradient descent and Adam.

Starting from the last layer, backpropagation computes the gradient of the loss function for each weight, one layer at a time, using the chain rule. Note that each layer can be consisted of multiple operations and weights, e.g., the Linear layer has a multiplication and an addition (y=xw+b). We iterate from the last layer backwards in a dynamic programming fashion to avoid redundant calculations of intermediate terms in the chain rule.

A common misconception is that backpropagation refers to the process computing the gradient as well as how the gradient is used after computation. In fact, backpropagation is only the process of computing the gradient. The usage of this gradient is another beast of its own and an example of such algorithm is the stochastic gradient descent. Those two components combined together consist of the learning algorithm.

As you might already know, calculation of the gradient of loss is based on derivatives and the chain rule of derivative, which are the backbone of backpropagation. So, it is reasonable to briefly scratch the surface of them next.

Derivatives

Derivatives are the building blocks of calculus.

The derivative of a function measures its sensitivity to fluctuations of its input. In other words, it’s a measure of how much the function’s output changes when the input is changed.

For example, the velocity of a moving object is the derivative of the position of that object with respect to time. Basically, this measures how quickly the position of the object changes in regard to time.

Geometrically, the derivative of a single variable function at a specific point is equal to the slope of the tangent line of that point, if the derivative exists (Figure 1). Basically, the derivative of a function maps the slopes (or gradients) of the tangent lines at each point of the original function’s scope. The tangent line is the best linear approximation of the function near that point. Hence, the derivative is often described as the “instantaneous rate of change”. In this frame, the derivate measures the rate of (instantaneous) change in the dependent variable y=f(x) that was caused by changing the independent variable x.

Figure 1: The derivate of f(x) at point x equals to the slope of the tangent line at that point

Near a point x, both the function and the linear approximation have nearly the same slope and since they both pass through the point (x, y), they should have nearly the same value, while close to x.

For completions sake, the equation of the linear approximation at point a is given by:

Equation of Linear Approximation for a Point a

The process of computing a derivative is called differentiation, while the reverse is called antidifferentiation.

A function of a real variable f(x) is differentiable at a point a, if the following limit exists:

Derivative Definition Formula at Point a — Derivative Definition at Point a

Then, the value of the derivative of f(x) at point a is equal to L. Generally, the derivate of f(x) can be calculated as:

Derivative Definition Formula — Derivative Definition

Let’s take a look to a couple of geometrical representations. Let’s consider first a quadradic function and its derivative:

a quadradic equation and its derivative: f(x)=3x²-4x+5 & f’(x)=6x-4

and their plots look like:

Geometric representation of quadratic function and its derivative

Then, for a cubic function and its derivative:

a quadradic equation and its derivative: f(x)=3x³-4x+5 & f’(x)=9x²-4

and in the geometrical space they look like:

Geometrical representation of cubic function and its derivative

Chain Rule

Given two differentiable functions f and g, the derivative of their composition is a special case handled by the chain rule. More precisely, for a function h such that h(x)=f(g(x)), the chain rule in Lagrange’s notation (or what I call high-school notation) is

The chain rule may also be expressed in Leibniz’s notation (or what I call academic notation), which is the one we’ll stick with from now on. If a variable z depends on a variable y, and, the variable y depends on a variable x, then z depends on x via the intermediate variable y (see equation below). Hence, y and z are dependent variables and x is independent.

In other words, if we know the rate of change of z relative to y and that of y relative to x, then the rate of change of z relative to x equals to the product of the two. In this case, the chain rule is expressed as

Leibniz’s Notation of the Chain Rule Formula — Chain Rule (*Leibniz’s notation*)

Let’s take a look at an intuitive example:

Given a car, a bicycle and a walking man, if the car travels with twice the speed of the bicycle and the bicycle moves four times faster than the walking man, then the car travels 2 × 4 = 8 times faster than the man.

In terms of the chain rule, let z, y and x be the relative positions of the car, the bicycle, and the walking man, respectively. The rate of change of relative positions of the car (z) and the bicycle (y) is 2, while the rate of change of relative positions of the bicycle (y) and the walking man (x) is 4. So, the rate of change of the relative positions of the car and the walking man is

Intuitive Example of the Chain Rule: Let z, y and x be the (variable) positions of the car, the bicycle, and the walking man, respectively. The rate of change of relative positions of the car and the bicycle is 2, while the rate of change of relative positions of the bicycle and the walking man is 4. So, the rate of change of the relative positions of the car and the walking man is 2*4=8 — Intuitive Example of the Chain Rule: rate of change of the relative positions of the car and the walking man

Implementation

Let’s now try to implement a simple version of back-propagation in Python. First, we need to define the building blocks of the process, namely a class that implements a Tensor-like structure. Why do we need to keep values in a class object, you may ask? Well, that’s for three main reasons. First, the obvious one, we need to store the value of the variable. Second, we need to store the gradient in a convenient manner. And third, when an operation takes place, we need to keep track of the involved Tensors, so that we know how to compute the gradient of their outcome. In actual fact, there are many more reasons, like functions such as backward to compute gradients and others like sum for summing all tensor values, view for reshaping or detach to remove gradients.

Let’s name the class MyTensor for uniformity. Now into this class:

We need to define some operations in order to be able to perform calculations with class objects.
We also need to define a backward function that computes the gradient throughout the computational graph.
For simplicity, we will also include some activation functions, like relu and tanh.

But hold on a minute. Mathematical operations between class objects you said? In order to conduct such operations, we need to define how they are going to take place by using operation overloading. If this is the first time hearing this, don’t worry its pretty simple. Basically, we need to write functions to let Python know what will happen when there is an operation between two class objects or a number and a class object. For this article, we will keep it simple and implement only the addition, subtraction and the backpropagation algorithm and at the at the second part, we will also include multiplication, division, exponentiation and some activation functions.

Addition

So, let’s start with addition. Did you know that typing x + y makes you a magician? What am I talking about? When Python sees x + y, it doesn’t directly perform the operation, but actually attempts to call x.__add__(y).

These functions are called “Dunder Methods” for “Double Underscore Method” (also called “magic method”).

def __add__(self, other):  # self (MyTensor) + other (MyTensor)
  other = other if isinstance(other, MyTensor) else MyTensor(other)
  out = MyTensor(self.data + other.data, (self, other), '+')
  
  def _backward():
    self.grad += 1.0 * out.grad
    other.grad += 1.0 * out.grad
    out._backward = _backward
  
  return out

Basically, in the __add__ function, we define that the data field of the class instances self and other is going to be summed. Also, we need to keep track of the effects that this operation causes on the gradient. Hence, _backward is used to update the gradients for both the self and other instances. This function basically implements the chain rule, where the multiplying factor out.grad is the accumulated gradient from prior steps and 1.0 is the gradient from the current step. Why is it 1.0?

Let’s look at an example. Consider the equation out = self + other.

The gradient of self (and similarly for other) is equal to

And that’s it for addition. OR … is it? __add__ covers the cases that we try to add two MyTensor instances MyTensor(6.0) + MyTensor(3.0) or MyTensor with a numerical value just like MyTensor(4.0) + 5.0. But, what will happen if we type 5.0 + MyTensor(4.0)? We will get a TypeError. This is because Python first tries to call the left object’s __add__() method x.__add__(y) and it fails as self is expected to be a MyTensor instance.

So, if calling __add__ fails, Python tries to fix it by calling y.__radd__() on the right operand of x + y. This method implements the reverse addition operation, that is addition with swapped operands. __radd__ is only called when Python doesn’t find an implementation of x.__add__(y) on the left operand for the operation x + y. If __radd__ isn’t implemented either, Python raises a TypeError.

Ok that’s nice, but why Python doesn’t just call y.__add__(x) instead of x.__add__(y) automatically? This would be wrong, because the operation may be non-commutative when defined as a custom operation. So, by calling __radd__, Python ensures that there would not be a non-commutative issue with the custom operation. __radd__ is defined as:

def __radd__(self, other): # other (int, float) + self (MyTensor)
  return self + other

And that’s it. Now that we finished the ‘heavy’ lifting, let’s move on to subtraction.

Subtraction

As we got out of the way the operation overloading and Dunder Method shenanigans, subtraction basically follows suit to the addition, but with a twist. Except from the obvious change in operation, we have a change in the gradient calculation. Although, the gradient for self remains the same and equal to 1.0

this is not the case for other. A minor change, but an important one:

And nothing fancy for the code:

def __sub__(self, other):  # self (MyTensor) - other (MyTensor)
  other = other if isinstance(other, MyTensor) else MyTensor(other)
  out = MyTensor(self.data - other.data, (self, other), '-')

  def _backward():
    self.grad += 1.0 * out.grad
    other.grad += -1.0 * out.grad

  out._backward = _backward

  return out

def __rsub__(self, other): # other (int, float) - self (MyTensor)
  return - (self - other)

Accumulating the Gradient

You may have noticed a plus sign while setting the variable’s gradient self.grad += 1.0 * out.grad instead of simply setting it self.grad = 1.0 * out.grad. This is to handle cases where a single variable is used multiple times in an expression, for example out = a - b + a + b:

a = MyTensor(-2.0, label='a')
b = MyTensor(3.0, label='b')
d = a - b    ; d.label = 'd'
e = a + b    ; e.label = 'e'
f = d + e    ; f.label = 'f'

f.backward()

In this case, if the gradients were calculated without the + symbol, the gradients would be only updated for one operation instead of all the existing operations. This means that the gradient of a would be 1.0 instead of 2.0 and the gradient of b would be -1.0 instead of 0.0.

Negative and Positive Numbers

So, we covered self + other and self - other, but what if we type something like -self + other or +self + other (not really in programming, but for completion shake)? Well, you guessed it right. You will be welcomed by a controversially beautiful TypeError. To cover these cases, we need to define __neg__ and __pos__ to define a negative and a positive instance class, respectively.

def __neg__(self): # -self
  return MyTensor(self.data * -1)

def __pos__(self): # +self
  return self

A Smarter Way

Now let’s take a look at a smarter approach that leverages algebraic operation manipulations to implement the subtraction operation.

Instead of re-writing the __sub__ function from scratch, we can just frame it to be an addition with a minus in front of the second number (-other):

def __sub__(self, other): # self - other
  return self + (-other)

Backpropagation Gradient Updates

Now that we have prepared some basic operations, its time to take a look at the algorithm that is going to backward update the gradients. The backward function defines the order in which we traverse the computation graph, starting from the latest operation all the way through the first one. Note that we don’t want to compute the gradient of a node (by calling _backward) before we have updated everything after it. This operation is known as topological sort.

To identify the order of the gradient updates for the chain rule, we traverse through each node with a recursion until we reach the leaf nodes, which we store first at a list. Finally, we unwrap this list and call the _backward function to calculate the gradient of each arithmetic operation in the proper order, while applying the chain rule at the same time.

def backward(self):
  # topological order all of the children in the graph
  topo = []
  visited = set()
  def build_topo(v):
    if v not in visited:
      visited.add(v)
      for child in v._prev:
        build_topo(child)
      topo.append(v)
  build_topo(self)

  # go one variable at a time and apply the chain rule to get its gradient
  self.grad = 1
  for v in reversed(topo):
    v._backward()

Unit Testing

It’s time for some testing. To verify the validity of our implementation, we backtest it against Pytorch’s autograd using some unit tests. Spoilers ahead; it passes.

# pytorch's implementation
w_t = torch.Tensor([-3.0]).double(); w_t.requires_grad=True
v_t = torch.Tensor([3.0]).double(); v_t.requires_grad=True
x_t = torch.Tensor([-1.0]).double(); x_t.requires_grad=True
u_t = torch.Tensor([4.0]).double(); u_t.requires_grad=True

out_t = x_t - (w_t + v_t) - u_t

# our implementation
w = MyTensor(-3.0)
v = MyTensor(3.0)
x = MyTensor(-1.0)
u = MyTensor(4.0)

out = x - (w + v) - u

# confirm that gradients are identical
assert w.grad == w_t.grad.item()
assert v.grad == v_t.grad.item()
assert x.grad == x_t.grad.item()
assert u.grad == u_t.grad.item()

print('Test Passed!')

You can find the code used for this article at: Google Colab - Implementing Backpropagation from Scratch - Part 1

And that’s it for now folks. In the next article, we will keep building it up and add multiplication, division, exponentiation and some activation functions.

If you liked the content, consider leaving a like and subscribe to keep up with AI and Machine Learning news and tutorials. I write about Natural Language Processing, Financial Time-series, Deep Learning, Transformers, and Feature Engineering among others.

Sources:

Disclosure

These series of articles are meant to be used as a quick sum-up of the microcrad project of Andrej Karpathy [1] for quick revision and comprehension purposes of the backprobagation mechanism. Code is extracted and re-purposed as needed from micrograd, all credit goes to the original creator, Andrej Karpathy.