DIY Deep Learning: Crafting Your Own Autograd Engine from Scratch for Effortless Backpropagation

Published in

SFU Professional Computer Science

17 min readFeb 11, 2023

Authors: Abhishek Deshpande, Bhavya Sankhla, Ayush Sharma, Rituraj Ojha

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Deep learning has revolutionized many industries, but building and training these complex models can be a challenging task. Attempting to train neural networks without comprehending the optimization process is akin to observing a magic performance. While deep learning frameworks like PyTorch and TensorFlow offer robust and optimized implementations of backpropagation, developing a deeper understanding of the underlying mechanics is highly useful. It not only helps in creating a solid foundational knowledge of how learning is done from data but is also crucial for practitioners to:

Better diagnose and debug their training issues.
Optimize their models by making informed decisions about hyperparameters, regularization, and other factors that affect training.
Implement custom loss functions and layer types, which can be essential for certain applications.
Develop a better understanding of the computational graph, which is the foundation of PyTorch’s dynamic computational model.

In this article, we will dive into the heart of deep learning by crafting our own tiny, scalar-valued autograd engine from scratch, enabling us to perform effortless backpropagation for training neural networks. This hands-on approach will give us a deeper understanding of the mechanics behind deep learning and provide us with greater control over the training process. Whether you are a seasoned deep learning practitioner or just getting started, this article will provide you with a solid foundation for building and training your own neural networks.

This guide is a walkthrough of parts of Andrej Karpathy’s popular micrograd repository (approach and code), which implements a tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API.

Overview

Unlike conventional guides, we’ll start off by directly showing you the usage and capabilities of the engine that you’ll build and tinker around with. This is done to provide a good idea of what you’re building and why you’re building it.

We know that neural networks are simply — directed acyclic graphs with lots of underlying multiply and accumulate operations. Consider the following contrived example, which shows a number of possible supported operations :

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = b * c
c += 1 + c + (-a)
d += d * 2 + (b + a).tanh()
g = c - d
print(f'{g.data:.4f}') # prints 13.9640, the outcome of this forward pass
g.backward()
print(f'{a.grad:.4f}') # prints -5.0707, i.e. the numerical value of dg/da
print(f'{b.grad:.4f}') # prints 1.9293, i.e. the numerical value of dg/db

Functionality

In the above example, a mathematical expression is being built out, where a and b are transformed into c, and then eventually d and g. The scalar values (stored in a and b) are wrapped in a Value object. All the functionality of the engine and the operations that it supports (that we’ll build from scratch) is dependent on the Value class. If you have two Value objects, you can add them, multiply them, raise to constant power, apply activation functions, divide by constants, etc. So by utilizing Value objects, we’re building out an expression graph with inputs (a and b) to create an output value (g). Our engine (which is essentially the Value class), in the background, will build out this mathematical expression while keeping track of the children nodes (like a and b of parent node c) of each involved operation (“addition” operation applied on a and b to produce c) by maintaining pointers to children Value objects.

Usage

By obtaining an expression graph using the way described above, we can tap into information of different nodes by using different node attributes. Not only can we do the forward pass and look at the value of g (which can be accessed by checking g.data), but we can also call g.backward() (like we do in PyTorch) to initialize backpropagation at node g and calculate gradients of all nodes automatically. This will start a backward journey through the expression graph and recursively apply the chain rule from calculus, which will evaluate the derivative of g with respect to all the internal nodes. To query the derivative of g with respect to any node, we just need to call <node>.grad. These derivatives provide the most important information as they tell us how a given node (say a) affects the output (g) through this mathematical expression.

We will also be able to generate computational graphs like the following to visualize our expressions:

NOTE: Building a scalar-valued autograd engine is excessive as in production you’ll always be dealing with n-dimensional tensors. This guide is built so that you can understand and refactor out backpropagation and chain-rule to understand neural network training. The underlying math and the fundamental logic in the case of an implementation involving high dimensional tensors remain the same; any changes done (to take advantage of parallelism/faster running) are purely for efficiency reasons.

Okay! So now that we have a fair idea of what we’re dealing with, let’s dive right in and implement everything step-by-step.

Understanding Derivatives

We know that the derivative of a function f(x) is given by:

The above equation means that if you slightly bump up your sample input (x) by h, with what sensitivity does the function respond or how does the slope of the response change? Let’s say we have multiple inputs on a simple function:

h = 0.0001

#inputs
a = 3.0
b = -2.0
c = 12.0

output1 = a * b + c
c+= h
output2 = a * b + c

print(f'output1 : {output1}') #prints output1: 6.0
print(f'output2 : {output2}') #prints output2: 6.0001
print(f'slope : {(output2 - output1)/h}') #prints slope : 0.9999999999976694

In the above example, we nudge one of the input parameters c and recalculate the function output. We then calculate the slope (normalized rise over run) which prints out a positive value of 0.99. We can intuitively make sense of this positive slope by looking at the equation as c is a positive value and is a “child node” of the + operator. So a small increment in the value of c will increase the overall function value by a small amount. The same intuition can be applied to reason through calculated slopes in the case of incrementing other inputs(and taking note of their signs).

This gives us an intuitive sense of what the derivative of a function represents. Now, we’d like to move towards neural networks. Since neural networks are massive mathematical expressions, we need some data structures to maintain these expressions. Now that we’re equipped with the required reasoning and background, we’re ready to start building out the Value object that we previously saw!

Value Object

Let’s start with a skeleton Value class:

class Value:
    
    def __init__(self, data):
        self.data = data
    
    def __repr__(self):
        return f"Value(data={self.data})"

a = Value(2.0)
print(a) #prints Value(data=2.0)

So a simple Value object takes a single scalar value that it wraps and keeps track of. The __init__ method stores the scalar data. Python will internally call the __repr__ function to return the stored value as a string.

Say, we want to add two Value objects. Let’s update our class:

class Value:
    
    def __init__(self, data):
        self.data = data
    
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data)
        return out

a = Value(3.0)
b = Value(-1.0)
print(a + b) #prints Value(data=2.0)

If we use the regular + operator, Python will internally call a.__add__(b). What we did instead is returned a new Value object which is just wrapping the addition of the data values of a and b.

Let’s also add multiply:

class Value:
    
    def __init__(self, data):
        self.data = data
    
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data)
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data)
        return out

a = Value(3.0)
b = Value(-2.0)
c = Value(12.0)
d = a*b + c

print(d) #prints Value(data=6.0)

We’re able to calculate simple expressions using our value class. However, what we’re still missing is the “connective tissue” of the expression — the pointers that keep track of what values produce what values. Let’s add the logic to our Value class for the same by introducing new variables “_children” and “_prev”:

class Value:
    
    def __init__(self, data, _children=()): #_children is an empty tuple by default
        self.data = data
        self.prev = set(_children) #empty set by default
    
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other)) #add children tuple
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other)) #add children tuple
        return out

a = Value(3.0)
b = Value(-2.0)
c = Value(12.0)
d = a*b + c

print(d) #prints Value(data=6.0)
print(d._prev) #prints {Value(data=12.0), Value(data=-6.0)}

Calling d._prev prints the values of the children nodes. We now know the children of every single value, but the last piece of information that’s missing is what operation created this value. So we add another element to our Value class called “_op”:

class Value:
    
    def __init__(self, data, _children=(), _op=''): #_op is an empty string by default
        self.data = data
        self.prev = set(_children) #empty set by default
        self._op = op
        self.label = label #node labels
    
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+') #add operator string 
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*') #add operator string
        return out

a = Value(3.0)
b = Value(-2.0)
c = Value(12.0)
d = a*b + c

print(d) #prints Value(data=6.0)
print(d._prev) #prints {Value(data=12.0), Value(data=-6.0)}
print(d._op) #prints +

Now we can calculate the expression using our Value class while keeping track of everything we want.

Visualization

Since these expressions are about to get quite larger, we need a tool to visualize the graph (similar to the visualization present in the overview section from before) that we’re tracking. The following code snippet (taken from micrograd/trace_graph.ipynb at master · karpathy/micrograd · GitHub) will help us nicely visualize the expressions that we’re building out using graphviz:

from graphviz import Digraph

def trace(root):
    # builds a set of all nodes and edges in a graph
    nodes,edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child,v))
                build(child)
    build(root)
    return nodes, edges
            
def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) #LR = left to right 
    
    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))
        #for any value in the graph, create a rectangular ('record') node for it
        dot.node(name=uid, label = "{data %.4f }" % (n.data, ), shape='record')
        if n._op:
            dot.node(name = uid + n._op, label = n._op)
            #and connect this node to it
            dot.edge(uid + n._op, uid)

    for n1, n2 in edges:
        #connect n1 to the op node of n2
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot

Calling the draw_dot function on the output node (d) will help us visualize our previous expression:

draw_dot(d)

Our forward pass looks good. Now in order to initialize backprop, we need a way to store our gradients as we move along the expression graph. To do this, we’ll add another variable/element to our Value class called “grad”:

class Value:
    
    def __init__(self, data, _children=(), _op=''): 
        self.data = data
        self.grad = 0.0 #initialize to zero
        self.prev = set(_children) 
        self._op = op
    
    def __repr__(self):
        return f"Value(data={self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+') 
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*') 
        return out

a = Value(3.0)
b = Value(-2.0)
c = Value(12.0)
d = a*b + c

print(d) #prints Value(data=6.0)
print(d._prev) #prints {Value(data=12.0), Value(data=-6.0)}
print(d._op) #prints +

We initialize self.grad to 0.0 as we assume that no value initially impacts or affects the output. We can also update our label definition in the draw_dot visualization function from before to incorporate gradients (and node labels):

def draw_dot(root, format='svg', rankdir='LR'):
    assert rankdir in ['LR', 'TB']
    nodes, edges = trace(root)
    dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})
    
    for n in nodes:
        #modified label definition
        dot.node(name=str(id(n)), label = "{ %s | data %.4f | grad %.4f }" % (n.label,           n.data, n.grad), shape='record')
        if n._op:
            dot.node(name=str(id(n)) + n._op, label=n._op)
            dot.edge(str(id(n)) + n._op, str(id(n)))
    
    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    
    return dot

Backpropagation

Consider the following graph (obtained from the previous expression and values that we were working with but with the added gradient tracking):

Our goal is to update the gradient values of each box by implementing backprop, which is simply put — a recursive application of the chain rule backward through the computation graph. To do this in context to the above graph, we need to be aware of 2 base nodes on which the above graph is built, namely the addition and the multiplication node.

Addition

The expression for the above graph is e = c+d. From calculus, we know that de/dc (and de/dd) is 1:

Multiplication

The expression for the above graph is c= a*b. From calculus, we know that dc/db is a (a.data in our case). Similarly, dc/da is b (or b.data).

Initialization

As a base case, we initialize the gradient (or node.grad) of the final output node (in our case e) to 1 (since the derivative of a variable with respect to itself is 1).

Let’s assume that the gradient of e stored in the above graph is 1. So de/dc (or c.grad) will become equal to the local derivative (de/dc=1) times the derivative being tracked from the final node (e.grad in our case, which we assume as 1). So our graph now looks as follows:

The next step is to traverse backward and calculate de/dc and de/dd. We notice that we come across a + operator, which (from our base node explanation from before) tells us that the gradient for nodes c and d will be 1. To put it more formally: the addition operator routes the incoming gradient backward (since the incoming gradient from the output node is just being multiplied by 1). So, we can update c.grad and d.grad both to 1 and update our graph:

Now, we travel further backward from our last visited node to calculate de/da and de/db. We come across the multiplication operator. While it’s not directly connected to the output node e (but via node c), we know (from our previous multiplication base case explanation) the local influence c has on a and b. We know from before that a.grad (or dc/da) will be equal to b.data and similarly, b.grad will be equal to a.data (the multiplication operator routes the data of the other input node as gradient for the first input node). This will provide us with the local influence of a and b. In order to find out the influence a and b have on the overall output, we just multiply (following chain rule) the local derivative (or “influence”) with the overall derivative that’s being tracked from the output, stored in c.grad. So a.grad = b.data * c.grad = -2 * 1 = -2. Similarly, b.grad becomes a.data * c.grad = 3 * 1 = 1. We now have all our gradients and our updated graph looks as follows:

Computational Graph with updated gradients

Now that we have a clear approach to computing gradients as well, let’s remove the training wheels and implement more complex mathematical expressions (neurons) and activation functions that are better suited to multi-layer perceptrons. Also, let’s finally ditch the manual gradient updates and build out automated backpropagation logic.

Let’s start by defining the tanh activation function in our Value class:

def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
    out = Value(t, (self, ), 'tanh')

One thing worth noting here is that we chose to directly define our tanh function using its expression rather than build it using existing (+ and *) and additional (exponentiation and division) operators. This is done deliberately to show that while we could build functions using Value class operators as building blocks, we don’t necessarily need to use the most “atomic” pieces and can actually create functions at arbitrary points of abstraction. They can be relatively complicated functions (like tanh) or very simple functions like the + operator and it’s totally up to us (and our optimization/efficiency requirements). The only thing that matters is we know how to differentiate the defined function (i.e. know how to create the local derivative / how the inputs impact the output of the defined function).

Let’s also update our mathematical expression to make it resemble a neuron. We build it in steps so that we have pointers to all the intermediate nodes:

# inputs x1,x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# weights w1,w2 (synaptic strengths)
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# bias
b = Value(7.0, label='b')
# x1*w1 + x2*w2 + b
x1w1 = x1 * w1; x1w1.label='x1*w1'
x2w2 = x2 * w2; x2w2.label='x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh()

draw_dot(o)

Now to automate our backpropagation logic, we add a function (_backward) to each operator definition in the Value class which defines how we chain the output’s gradient into the input gradient. Let’s start by looking at the addition function:

class Value:
    
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0 #at initialization, op is not affected by Value object
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self._backward = lambda: None # None for leaf nodes in the graph
        
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward(): # defining backward logic for addition
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out

We initialize _backward as None, indicating that initially, it does nothing. This is also done to handle the case where a node is a leaf node (since it has no input that needs to be chained to the output gradient). The _backward function() above defines the routing logic for the + operator that we discussed previously. We call out._backward (similar to PyTorch’s backward() function) to propagate the gradient as we traverse backward in the graph. Let’s define def _backward() for our remaining operators and activation functions and complete our Value class:

class Value:
    
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0 #at initializtion, op is not affected by Value object
        self._backward = lambda: None # None for leaf nodes in the graph
        self._prev = set(_children)
        self._op = _op
        self.label = label
        
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, (self, ), 'tanh')
        
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out

All we need to do now is call .backward() (on non-leaf nodes) and all our gradients get computed automatically!

o._backward()
n._backward()
x1w1x2w2._backward()
x2w2._backward()
x1w1._backward()

draw_dot(o)

Calling _backward() on each non-leaf node is still fairly tedious, so let’s do one final addition to our Value class after which one backward call at the output node (like PyTorch) is all we need to auto-calculate all gradients in the graph. We add topological sort to our Value class to backpropagate through the entire expression graph:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    #print(topo)

    self.grad = 1.0  #base-case
    for node in reversed(topo):
        node._backward()

Using this sorting technique, we start at O and visit the nodes in topological order. O is only going to add itself to the topo list (in the code above) after all the children have been processed.

Putting everything together!

The final version of the Value class (tip: check out comments in the following code that fix out small bugs for a more robust implementation):

class Value:
    
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0 #at initializtion, op is not affected by Value object
        self._backward = lambda: None # None for leaf nodes in the graph
        self._prev = set(_children)
        self._op = _op
        self.label = label

        
    def __repr__(self):
        return f"Value(data={self.data})"
    

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        """
        ^ to work with non-value objects by wrapping them around Value and thereby providing them a data attribute
        eg: to perform a + 1 (where 1 is not a Value object)
        """
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            """
            using += instead of + to accumulate gradients instead of overwriting (in the case of adding a node to itself)
            """
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out
   
 
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

  
    def __neg__(self): # -self
        return self * -1


    def __radd__(self, other): # other + self
        return self + other


    def __sub__(self, other): # self - other
        return self + (-other)


    def __rsub__(self, other): # other - self
        return other + (-self)


    def __rmul__(self, other): # other * self
        return self * other
    

    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, (self, ), 'tanh')
        
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
    

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        print(topo)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

The final version of forward and backward pass:

#Forward Pass
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(7, label='b')
x1w1 = x1 * w1; x1w1.label='x1*w1'
x2w2 = x2 * w2; x2w2.label='x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'

#Backward Pass
o.backward()

#Graph
draw_dot(o)

Computational Graph with auto-calculated gradients

Wrapping up

And that’s it! This should provide you with sufficient background and boilerplate code to:

Expand your implementation’s capabilities (by adding more operators, activation functions, and complex mathematical expressions).
Dig deeper into Karpathy’s micrograd repository and understand his neural network library implementation that builds up vectorized abstractions for Neurons, Layers, and MLPs.
Complete your neural network training process by further understanding and implementing micrograd’s code for loss calculations and gradient descent optimization.

Happy Learning :)

References

GitHub — karpathy/micrograd: A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API