Back(prop) to Basics — Fundamentals of Neural Networks

Elad Rapaport
31 min readMay 26, 2023

--

Andrej Karpathy’s YouTube video course — Lecture Notes. Part 1.

Photo by JJ Ying — https://unsplash.com/photos/8bghKxNU1j0

This post is based entirely on Andrej Karpathy’s video tutorial “The spelled-out intro to neural networks” — https://www.youtube.com/watch?v=VMj-3S1tku0&t=4822s

In this post, we will learn about the very basic mechanisms of neural networks — how does a network learn to classify dog and cat images? How does a network learn to generate realistic text? The basics behind all of these tasks are the same!

Let’s start by asking ourselves some questions about neural networks (and answering them):

  1. What are neural networks? They are simply mathematical functions that have an input and an output.
  2. What does it mean to “train” a neural network? It means tuning the function’s parameters such that its output is as close as possible to the truth.
  3. How do we tune these parameters? Using an algorithm called backpropagation, which calculates the derivative of the target term which we aim to minimize, with respect to each parameter of the network, and uses these derivatives to nudge each parameter in the right direction (that will lead to minimization of the target term).

In essence, after you read the above questions, you know how neural networks work. But of course, we wish to go (a bit) deeper than that. Hence, we will follow Andrej Karpthy’s excellent course on YouTube, and dive into the nitty-gritty of neural networks, piece by piece.

This post will be organized as follows:

  • Part 1. What is a Derivative?
  • Part 2. Micrograd — A Minimalistic Autodiff Tool
  • Part 3. Micrograd — Manual Calculation of Gradients
  • Part 4. Micrograd — Building a simple neuron
  • Part 5. Manual Backpropagation through a Neuron
  • Part 6. Implementing Backward Operations
  • Part 7. Adding support for constants
  • Part 8. Breaking up Tanh into its constituting parts
  • Part 9. Comparing Micrograd to Pytorch
  • Part 10. Building a multi-layer perceptron (MLP)
  • Part 11. Training the MLP
  • Part 12. Summary

Just like Andrej, we will start at the very (very) basics. What is a Derivative? It might seem basic but it’s important to get a very firm grasp on this subject before moving on. Derivatives are the core of the backpropagation algorithm, which is the fuel behind (almost) every modern neural network.

Part 1. What is a Derivative?

The layman’s definition of a derivative is this:

“I have a function in which y is dependent on the value of x. If I nudge x just a tiny bit at a certain point, what would be the rate of change of y in response to that?”

For those of you that prefer equations, let’s see what Wikipedia has to say about this:

Eq. 1 — What is a derivative? Taken from Wikipedia.

As we see in eq.1 — We take f(a) and nudge it by a tiny amount — h. We take the difference between f(a) and the nudged version — f(a + h). Then we divide it by the amount that we nudged — h, to get the rate of change. Also, notice the lim in the equation. This means that the true slope will be obtained when the nudging amount h is very close to 0.

Let’s see some Python code to make this clearer. We will first create a polynomial function and plot it using matplotlib.

def f(x):
return 3 * x ** 2 - 4 * x + 5

xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)
Fig. 1 — f(x) plotted.

What is the slope at x=3? Before reading ask yourself — is positive? negative?

Let’s calculate it using the definition of the derivative we just saw.

h = 0.00001
x = 3
(f(x + h) - f(x)) / h

Out: 14.00003000000538

And indeed, we get the correct answer (you can validate this by using calculus to get the derivative and placing 3 instead of x).

Let us now do the same thing with a slightly more complex function that depends on several parameters. We will check what is the derivative of the function with respect to each one of its parameters.

def g(a, b, c):
return a * b + c

a = 2
b = -3
c = 10

Before reading on, try to think what will be the slope of the function g(a, b, c) with respect to each of its parameters.

print(f'Slope w.r.t a: {(g(a + h, b, c) - g(a, b, c)) / h}')
print(f'Slope w.r.t b: {(g(a, b + h, c) - g(a, b, c)) / h}')
print(f'Slope w.r.t c: {(g(a, b, c + h) - g(a, b, c)) / h}')

Out:

Slope w.r.t a: -3.000000000064062
Slope w.r.t b: 2.0000000000131024
Slope w.r.t c: 0.9999999999621422

We see that if we nudge a by a tiny amount the slope is negative. This makes sense because a is multiplying b which is a negative number, which means we get more of this negativeness in the function. The logic is similar for the other parameters :)

Part 2. Micrograd — A Minimalistic Autodiff Tool

In this tutorial, we are going to build Micrograd — a minimalistic auto-differentiation utility in Python. Autodiff is the basis behind modern neural network libraries such as TensorFlow and Pytorch and is an algorithm for automatically calculating gradients in computational graphs (such as neural networks). The tendency is to think that Autodiff is a very complicated piece of machinery, yet Andrej shows in his course that we can create a fully functional Autodiff tool (Micrograd) in just 100~ lines of code.

The full version of Micrograd (by Andrej Karpathy) can be found here — https://github.com/karpathy/micrograd.

The purpose of an Autodiff tool is to be able to produce the slope of any function with respect to any one of its parameters. This is useful in neural network training because neural networks are giant functions with many parameters, and to improve them we need to calculate their slope with respect to our target, to understand how we should nudge each of the function’s parameters.

An Autodiff tool works by keeping track of a computational graph, which it uses to understand which parameters affected which parts of the function calculation and how. As we will see, each type of mathematical operator has a different derivative, and by chaining together these derivatives using the chain rule we will be able to obtain the slope of a very deep set of operations.

Let’s dive in. First, we need some way to represent a number in Python which will allow us to follow the calculations that it is involved in. This will help us build the computational graph needed for Autodiff.

class Value:
def __init__(self, data):
self.data = data

def __repr__(self):
return f'Value(data={self.data})'

a = Value(2.0)
b = Value(-3.0)
print(a)
print(b)

Out:

Value(data=2.0)
Value(data=-3.0)

We created a class called Value which keeps track of a value of a number, and implements the __repr__ function, which tells Python how to print the representation of this object. Not very useful right? What if we want to add two numbers, for example? Let’s add some computational abilities to this class.

class Value:
def __init__(self, data):
self.data = data

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
return Value(self.data + other.data)

def __mul__(self, other):
return Value(self.data * other.data)

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
a * b + c # equivalent to (a.__mul__(b)).__add__(c)

Out:

Value(data=4.0)

Great! We are now able to reproduce our function from before using the Value class. Notice that a * b + c is equivalent to (a.__mul__(b)).__add__(c) and this is because Python knows it should use the __add__ and __mul__ functions (if they were implemented) when using + and * operators. But wait, this still doesn’t give us the computational graph we wanted so we can calculate derivatives. Let’s make some changes to the class to keep track of the calculations that occur.

class Value:
def __init__(self, data, prev=(), op='', label=''):
self.data = data
self._prev = set(prev)
self._op = op
self.label = label
self.grad = 0

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
return Value(self.data + other.data, prev=(self, other), op='+')

def __mul__(self, other):
return Value(self.data * other.data, prev=(self, other), op='*')

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
ans = a * b + c # equivalent to (a.__mul__(b)).__add__(c)
print(f'ans._prev: {ans._prev}')
print(f'ans._op: {ans._op}')

Out:

ans._prev: {Value(data=10.0), Value(data=-6.0)}
ans._op: +

We added new parameters to the class: self._prev and self._op . These parameters are used to keep track of the participating elements in each calculation, and which operator was used in each calculation. We can see that we printed the _prev and _op values of the final value of the function, of which the last operation was an addition, in which c and a*b participated (if we were to print the _prev and _op values of the inner element, we should see 2, -3, and *). We also added a label parameter that will help us visualize the computational graph soon and a grad parameter that will hold derivative information for each parameter.

We will now add a function that will help us visualize the computational graph which was built by the Value objects.

from graphviz import Digraph

def trace(root):
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges

def draw_dot(root):
dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'})
nodes, edges = trace(root)
for n in nodes:
uid = str(id(n))
dot.node(name=uid, label=f'{n.label} | data: {n.data:.4f} | grad: {n.grad:.4f}', shape='record')
if n._op:
dot.node(name=uid + n._op, label=n._op)
dot.edge(uid + n._op, uid)
for n1, n2 in edges:
dot.edge(str(id(n1)), str(id(n2)) + n2._op)
return dot

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label = 'f')
L = d * f; L.label = 'L'

draw_dot(L)

Out:

output of draw_dot — simple computational graph

We created a draw_dot function that takes a Value object and builds the computational graph associated with it. First, it recursively finds all nodes and edges using the trace function and then continues to draw these elements using the graphviz library. Notice that we explicitly had to label the e , d and L nodes for their names to appear in the visualization.

We will now recap what we did so far. We created a Value class that allows us to create a simple computational graph built of addition and multiplication operators. We are also able to visualize this graph using the graphviz library. But we already saw before that in order to learn neural network parameters, we need to be able to calculate the gradients for each parameter in the computational graph with respect to the output node. As a first step, we will calculate each of the gradients manually and populate the respective gradient parameter.

Part 3. Micrograd — Manual Calculation of Gradients

Our function consists of seven parameters — a, b, c, d, e, f, L. Let us think of the gradients for each, with respect to the output — L.

  • The gradient of Lwith respect to itself is simply 1. Why? How much will Lchange when we add h to it? Well of course, by h. Since the gradient is the relative change, the answer will be h/h = 1. That was easy, right?
  • Now, what is the gradient of d with respect to L? We have that L=d*f. From calculus, we know that dL/dd = f. Let’s also prove this using the definition of a derivative. As a reminder, the definition is (f(x+h) — f(x))/h (rise over run). So if we want the derivative of L with respect to d we get ((d+h) * f-d * f)/h, which is (df + hf — df)/h, which is hf/h, which is f = -2.
  • Similarly, dL/df = d = 4.
  • Now we are getting to the tricky part. What is dL/dc? What is the ratio of change of L when we slightly change c? It is trickier because the effect isn’t direct, but rather passes through the variable d. Luckily for us, calculus has us covered with the chain rule. What does it entail? That if we want to calculate dL/dc which passes through d, we simply need to calculate dL/dd and dd/dc and multiply between them.
    Easy!
    So, we already know dL/dd. As for dd/dc, d=c+e, how much does d change when we add h to c? By h of course. Since a derivative is the ratio of change, it will be h/h=1. This means that according to the chain rule, we get dL/dc = dL/dd*dd/dc = f*1 = f = -2.
  • By symmetry we get dL/de = f*1 = -2
  • What about dL/da. From the chain rule, we get that it is dL/de * de/da. We already know dL/da. de/da for e=a*b is just b as we saw previously. This means that dL/da = dL/de*de/da = f * b = -2 * -3 = 6.
  • By symmetry we get dL/db = dL/de * de/db = f * a = -2 * 2 = -4

In the video course, Andrej proves all of the gradients are correct using the definition of the gradient, as we did previously (checking rise over run). For brevity, we will skip this in this article.

Finally, let’s manually assign all the gradients we found to their respective parameters and view the computational graph again.

a.grad = 6
b.grad = -4
c.grad = -2
d.grad = 4
e.grad = -2
f.grad = 4
L.grad = 1
draw_dot(L)

Out:

output of draw_dot — added gradients

Let’s now imagine that the function above is a neural network that we want to optimize. Our optimization goal will be to increase the value of L. For each parameter, we calculated a gradient which tells us which direction should this parameter move in order to increase the value of L. We will use this information to optimize the function (note that we will do this only for the nodes which we have control over — that are not the direct result of an operation):

a.data += a.grad * 0.01
b.data += b.grad * 0.01
c.data += c.grad * 0.01
f.data += f.grad * 0.01

e = a * b
d = e + c
L = d * f

print(L.data)

out: -7.286496

Indeed we see that after we optimized our parameters using the gradients we manually calculated, and ran the forward pass of our network again, we managed to increase the value of L.

Part 4. Micrograd — Building a simple neuron

We are now ready to move on to real neural networks. But first — what is an artificial neuron?

Fig. 1 — An artificial neuron. Image by the author.

In Figure 1 we see a depiction of an artificial neuron that contains 3 inputs — X1, X2, and X3. Each input is multiplied by its corresponding weight and these multiplications are added up. An additional bias term b is added as well. This bias term is inherent to the neuron and will be the same for every input. Finally, an activation function is applied. In this tutorial, we will use the Tanh function, which squashes the input between -1 and 1. The artificial neuron vaguely mimics what we think happens in a real neuron, where neurons receive input from other neurons via synapses and send their outputs via axons.

Now, let us implement an artificial neuron and understand how to use backpropagation to tune its weights in order to maximize a certain target function.

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.7, label='b')

x1w1 = x1*w1; x1w1.label = 'x1*w1'
x2w2 = x2*w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
# o = n.tanh()
draw_dot(n)

Out:

output of draw_dot — a neuron

We implemented a simple neuron with two inputs: x1 and x2. Notice that we commented out the use of the Tanh activation function. Why? Because we haven’t yet implemented it in our Value class! So let’s do that. But first — what is exactly a Tanh function?

This is the definition of Tanh taken from Wikipedia:

Eq. 2 — Tanh (from Wikipedia)

We have two options now — one will be to implement the exp function in Value. This will allow us to create the Tanh function out of its basic building blocks of addition (subtraction), multiplication (division), and exponentiation. The second option will be to implement the Tanh function directly. In backpropagation, we don’t have to explicitly implement the smallest parts of the computational graph — we can use arbitrarily large building blocks as long as we can implement their forward pass and derivative. We will choose the second option and implement Tanh in the Value class.

class Value:
def __init__(self, data, prev=(), op='', label=''):
self.data = data
self._prev = set(prev)
self._op = op
self.label = label
self.grad = 0

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
return Value(self.data + other.data, prev=(self, other), op='+')

def __mul__(self, other):
return Value(self.data * other.data, prev=(self, other), op='*')

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
return out

We added the tanh function, which contains the calculation for the Tanh operation as we saw earlier. Notice that we pass a 1-item tuple as the prev parameter because Tanh operates only on one input.

Now, after we comment out the Tanh operation we can plot the full computational graph:

output of draw_dot — neuron with Tanh

Now it’s time to backpropagate through our neuron!

Part 5. Manual Backpropagation through a Neuron

I think this part is a nice exercise to do on your own, so if you are up to it — stop here and work the grad values of each of the parameters with respect to o(you don’t have to :)).

Let’s work through it:

  • The gradient of o with respect to itself is 1.
  • The gradient of owith respect to n is the gradient of the Tanh function, which is defined by: dtanh(x)/dx = 1 — tanh(x)^2. We will also multiply this by the gradient of o according to the chain rule. So we get that n.grad = (1 — o.data**2) * o.grad = 0.5.
  • The gradient of o with respect to b according to the chain rule is dn/db * do/dn. Remember that the addition operation simply passes the gradient from the previous step because the local derivative is 1. So we get that b.grad = 1*n.grad = 0.5.
  • Similarly, x1w2x2w2.grad = 1*n.grad = 0.5.
  • How about x1w1.grad? It’s also a result of an addition operation, hence: x1w1.grad = 1*x1w1x2w2.grad = 0.5.
  • Similarly, x2w2.grad = 1 * x1w1x2w2.grad = 0.5.
  • For x1.grad we have a multiplication operation, of which the local gradient is the other parameter. To this, we apply the chain rule, hence x1.grad = x1w1.grad * w1 = -3 * 0.5 = -1.5.
  • Similarly, we get that w1.grad = x1w1.grad * x1 = 2 * 0.5 = 1, and x2.grad = x2w2.grad * w2 = 0.5 * 1 = 0.5, and w2.grad = x2w2.grad * x2 = 0.5 * 0 = 0.

Let’s sum it up in code and plot the resulting graph:

o.grad = 1  # do/do
n.grad = (1 - o.data ** 2) * o.grad # do/dn * do/do
b.grad = 1 * n.grad # dn/db * do/dn
x1w1x2w2.grad = 1 * n.grad # dn/dx1w1x2w2 * do/dn
x1w1.grad = 1 * x1w1x2w2.grad # dx1w1x2w2/dx1w1 * do/dx1w1x2w2
x2w2.grad = 1 * x1w1x2w2.grad # dx1w1x2w2/dx1w1 * do/dx1w1x2w2
w1.grad = x1.data * x1w1.grad #dx1w1/dw1 * do/dx1w1
x1.grad = w1.data * x1w1.grad #dx1w1/dx1 * do/dx1w1
w2.grad = x2.data * x2w2.grad #dx2w2/dw2 * do/dx2w2
x2.grad = w2.data * x2w2.grad #dx2w2/dx2 * do/dx2w2
draw_dot(o)

Out:

output of draw_dot — neuron with gradients

Great! We know how to do backpropagation! So is this how they do it in GPT-3 (which has 170B parameters)? Do they write 170B lines of code to set the gradient of each parameter? Well, no. Let’s see how to do it properly and automatically.

Part 6. Implementing Backward Operations

We are now going to add a _backward function for each operation in our Value class. This function will calculate the gradient for each computational node instead of us having to do it manually.

class Value:
def __init__(self, data, prev=(), op='', label=''):
self.data = data
self._prev = set(prev)
self._op = op
self.label = label
self.grad = 0
self._backward = lambda: None

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
out = Value(self.data + other.data, prev=(self, other), op='+')
def _backward():
self.grad = 1.0 * out.grad
other.grad = 1.0 * out.grad
out._backward = _backward
return out

def __mul__(self, other):
out = Value(self.data * other.data, prev=(self, other), op='*')
def _backward():
self.grad = other.data * out.grad
other.grad = self.data * out.grad
out._backward = _backward
return out

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad = (1.0 - out.data ** 2) * out.grad
out._backward = _backward
return out

Let’s go over the code.

  • The default _backward function returns nothing. This will be the case for leaf nodes which are not part of any operation.
  • In addition, the _backward function sets the gradient of both operands according to the chain rule, which is the local derivative (1) times the previous derivative in the chain (the derivative of out). Note that we are assuming that the derivative of out has already been calculated.
  • The case for multiplication is similar, except that the local gradient is the other operand instead of 1 — other.grad for self, and self.grad for other.
  • For the Tanh operation, we don’t have an other so we only need to set self.grad. Just like before, we use the local derivative of Tanh and multiply it with the derivative of out which we assume has been calculated previously.

Let’s use the _backward functions:

o.grad = 1.0
o._backward()
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward()
x1w1._backward()
draw_dot(o)

Out:

I will avoid pasting the graph, as it is identical to the previous case (only this time we used the _backward functions in the correct order to calculate the gradients).

Note that we had to set o.grad = 1.0 manually. Why? Because the placeholder gradient value for each node is 0. The backward function always depends on the previous gradient in the computational graph and if it was 0 then all gradients will turn out to be 0 (think about it).

But do we have to call the _backward function manually? Of course not! We can use a loop here, but notice that we need to iterate over the nodes in a certain order. We can’t call _backward on a node that has neighbors connected via outgoing edges whose gradients haven’t been calculated already. For this, we need something called topological sorting.

Let’s understand what is topological sorting using an example. Take this graph:

Fig 2. Topological sorting (from Wikipedia)

Here are three options to topologically sort this graph (taken from Wikipedia — https://en.wikipedia.org/wiki/Topological_sorting):

  • 5, 7, 3, 11, 8, 2, 9, 10 (visual top-to-bottom, left-to-right)
  • 3, 5, 7, 8, 11, 2, 9, 10 (smallest-numbered available vertex first)
  • 5, 7, 3, 8, 11, 10, 9, 2 (fewest edges first)

Notice that in all sorting options, for each node u that has an outgoing edge to node v, node v will be after u in the topological sorting.

Let’s implement a simple form of topological sorting and use it to apply the _backward functions in our computational graph:

topo = []
visited = set()

def build_topo(v, topo, visited):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child, topo, visited)
topo.append(v)

The above is a simple implementation of topological sorting. Using recursion we make sure that all of the offspring of a certain node will be added to the list topo before the said node. Instead of calling this function explicitly, we will make the code a bit cleaner and make it part of our Value class, such that only a single function must be called to calculate gradients for the whole computational graph.

class Value:
def __init__(self, data, prev=(), op='', label=''):
self.data = data
self._prev = set(prev)
self._op = op
self.label = label
self.grad = 0
self._backward = lambda: None

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
out = Value(self.data + other.data, prev=(self, other), op='+')
def _backward():
self.grad = 1.0 * out.grad
other.grad = 1.0 * out.grad
out._backward = _backward
return out

def __mul__(self, other):
out = Value(self.data * other.data, prev=(self, other), op='*')
def _backward():
self.grad = other.data * out.grad
other.grad = self.data * out.grad
out._backward = _backward
return out

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad = (1.0 - out.data ** 2) * out.grad
out._backward = _backward
return out

def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for node in reversed(topo):
node._backward()

Now, all we need to do to calculate gradients is:

o.backward()
draw_dot(o)

Out:

Same graph as before

Now, I want to show you a slight problem we have in our current implementation of Value — what happens when a node in the computational graph gets used more than once?

Part 6. What happens when a node gets used more than once?

Take this example:

a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b; d.label = 'd'
e = a + b; e.label = 'e'
f = d * e; f.label = 'f'

f.backward()
draw_dot(f)

Out:

output of draw_dot — using a component twice in the computational graph

What should be the gradients of a and b in this case? According to the multiplication node — the gradient of a should be 3 and the gradient of b should be -2 (in the multiplication of two numbers the local gradient is the other operand). According to the addition node, the gradient of both should be -6. We see that in this case the calculation was based on the addition node, but that is just because of the arbitrary topological sorting.

Why does this happen? Let’s observe the _backward function of the __add__ operation:

def _backward():
self.grad = 1.0 * out.grad
other.grad = 1.0 * out.grad

We see that we set the gradients of self and other, while ignoring the previous values that might have been already calculated for it. In backpropagation, the gradient of a certain node will be calculated additively — the effects of all calculations will add up.

We can fix this situation quite simply by changing the = operator in the _backward functions to += s.t the effects will add up. Like so:

def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad

After we fixed our _backward functions and run the previous code again, we get this computational graph:

output of draw_dot — using a component twice in the computational graph, fixed gradients

We see here the correct gradients for a and b where the effects of the + and * nodes add up, instead of overriding each other.

Part 7. Adding support for constants

What will happen if we run this code?

a = Value(2.0)
a + 2

We get this error — AttributeError: ‘int’ object has no attribute ‘data’. Why? Because the __add__ function of Value is expecting another Value object. We can overcome this quite simply by performing this fix for the __add__ function (and similarly for the __mul__ function):

def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, prev=(self, other), op='+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out

We simply check whether other is of type Value. If not, we convert it :)

Now what happens when we run this code?

a = Value(2.0)
2 + a

Seems like it should work right? Turns out it won’t (we get this error — TypeError: unsupported operand type(s) for +: ‘int’ and ‘Value’). Python in this case is trying to run 2.__add__(a). The default __add__ function of Python obviously doesn’t know how to handle Value objects, because we made them up. How do we fix this?

For this purpose, Python supports the __radd__ function:

def __radd__(self, other):
return self + other

The way it works is that if Python fails to call __add__ for some reason and __radd__ is available, it will swap the operands of __add__ and call __radd__. Thus, a + 2 will be called instead, and we already saw that it works. We will act accordingly with __mul__ and __rmul__, of course.

Part 8. Breaking up Tanh into its constituting parts

Currently, our Value class supports the Tanh operation as a complete block. What if we wanted to break Tanh into its constituting parts by explicitly including the Tanh calculation in a computational graph?

To achieve this we need to add several operators. Let’s remind ourselves of the Tanh implementation:

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
return out

To break up this operation we need the following operators — exponentiation (exp), power (pow), division, and subtraction.

For division, we will implement the following:

def __truediv__(self, other):
return self * other ** -1

The operation we just used is mathematically equivalent to self/other, but notice that we depend on the power (**) operator, which we haven’t yet implemented. Let’s do that:

def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting pow for int/float"
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
out._backward = _backward
return out

Let’s note several things about __pow__. First, we only support int or float inputs. This means we don’t support power with more complex inputs (namely arbitrary Value objects) because that would lead to different derivative rules. Second, we implement the _backward function using the calculus rule where dxⁿ/dx=nxⁿ⁻¹.

Finally, we will need to support subtraction, and we will simply utilize addition for that as so:

def __sub__(self, other):
return self + (-other)

For this we need to support negation, so we add:

def __neg__(self):
return self * -1

Let’s view our wholeValue class so far:

class Value:
def __init__(self, data, prev=(), op='', label=''):
self.data = data
self._prev = set(prev)
self._op = op
self.label = label
self.grad = 0
self._backward = lambda: None

def __repr__(self):
return f'Value(data={self.data})'

def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, prev=(self, other), op='+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out

def __radd__(self, other):
return self + other

def __neg__(self):
return self * -1

def __sub__(self, other):
return self + (-other)

def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, prev=(self, other), op='*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out

def __rmul__(self, other):
return self * other

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1.0 - out.data ** 2) * out.grad
out._backward = _backward
return out

def exp(self):
x = self.data
t = math.exp(x)
out = Value(t, (self,), 'exp')
def _backward():
self.grad += t * out.grad
out._backward = _backward
return out

def __truediv__(self, other):
return self * other ** -1

def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting pow for int/float"
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
out._backward = _backward
return out

def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for node in reversed(topo):
node._backward()

Now, we shall break up the Tanh function into its constituent parts, such that the whole calculation will appear in the computational graph:

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.8813735870195432, label='b')

x1w1 = x1*w1; x1w1.label = 'x1*w1'
x2w2 = x2*w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'

e = (2*n).exp()
o = (e - 1) / (e + 1)

o.backward()
draw_dot(o)

Out:

output of draw_dot — breaking up Tanh

Great! We now have a much larger computational graph and can see all the components of the Tanh function. Notice that the gradients for w1 and w2 are the same as before, thus we know that this graph is operationally identical to the previous (where the Tanh operation was implicit).

What do we learn from this exercise? In backpropagation and neural network training, we can implement an operation as abstract as we want. As long as we can define the forward and backward pass by calculating the local gradient for such an operation it doesn’t matter that it by itself is composed of other operations (think of the fact that multiplication can be described by many additions).

Part 9. Comparing Micrograd to Pytorch

Up until now, we developed our own auto-differentiable computation graph class — Value. We validated that it is working as expected and that we can back-propagate through arbitrarily complex graphs as long as they are made of supported building blocks.

Despite this, our implementation is very naive computation-wise and there are better and faster libraries out there that do just this. One example is the popular Pytorch library, developed by Meta. The Pytorch library allows the creation of highly efficient computational graphs which can be parallelized and optimized for different hardware including CPUs, GPUs, and TPUs. It is used extensively in industry and cutting-edge research — mostly for neural networks.

To validate our implementation of Micrograd we will implement the exact same computational graph from before in Pytorch and confirm that the results are identical.

import torch
x1 = torch.Tensor([2.0]).double(); x1.requires_grad = True
x2 = torch.Tensor([0.0]).double(); x2.requires_grad = True
w1 = torch.Tensor([-3.0]).double(); w1.requires_grad = True
w2 = torch.Tensor([1.0]).double(); w2.requires_grad = True
b = torch.Tensor([6.8813735870195432]).double(); b.requires_grad = True

n = x1*w1 + x2*w2 + b
o = torch.tanh(n)

print(o.data.item())
o.backward()

print('----')
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())

Out:

0.7071066904050358
----
x2 0.5000001283844369
w2 0.0
x1 -1.5000003851533106
w1 1.0000002567688737

Great! We got the exact same gradients as in our own implementation.

The default case for Pytorch is to work with tensors which are multi-dimensional arrays. Hence the need to initialize our parameters as 1-value tensors (for example Tensor([2]) instead of Tensor(2)). We see that we can use arithmetic operations like * and +, and also more complex operations such as torch.tanh(...). Just like in Micrograd, we ran .backwards() on our result node and by doing that ran backpropagation through every node in our graph. We needed to explicitly set requires_grad=True for each node because Pytorch does not calculate the gradients for leaf nodes by default.

Part 10. Building a multi-layer perceptron (MLP)

Using our main building block — the Value class, we are now going to build our first neural network! We will write a class that will allow us to create simple MLPs. Let’s first create one and then talk about how we use backpropagation to optimize the weights for a specific task.

class Neuron:
def __init__(self, nin, layer_idx, neuron_idx):
self.layer_idx = layer_idx
self.neuron_idx = neuron_idx
self.w = [Value(random.uniform(-1, 1), label=f'w^{layer_idx}_({neuron_idx}{i+1})') for i in range(nin)]
self.b = Value(random.uniform(-1, 1), label=f'b^{layer_idx}_{neuron_idx}')

def __call__(self, x):
act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
out = act.tanh()
out.label = f'L_{self.layer_idx}{self.neuron_idx}'
return out

def parameters(self):
return self.w + [self.b]

class Layer:
def __init__(self, nin, nout, layer_idx):
self.neurons = [Neuron(nin, layer_idx, neuron_idx=i+1) for i in range(nout)]

def __call__(self, x):
outs = [n(x) for n in self.neurons]
return outs[0] if len(outs) == 1 else outs

def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]

class MLP:
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1], layer_idx=i+1) for i in range(len(nouts))]

def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x

def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]

Surprisingly, this is all of the code!

  • Class Neuron contains code for a computational unit that receives nin inputs and outputs a weighted sum of them plus a bias term, followed by a Tanh operation.
  • Class Layer aggregates several neurons such that it takes certain data as input (of size nin) and passes it through nout neurons. We do a small check in the end where if the output contains only one neuron we return it as-is and not in a list.
  • Class MLP is the final abstraction in which we aggregate noutinstances of Layerby passing the input data xthrough each of them in succession.
  • I added some code that did not appear in the original tutorial by Andrej to display the labels of the weights and neurons so we can make some sense in the drawings we will create later.
  • For each class, we added a parametersfunction that collects all trainable parameters. This will come in handy later when we train the network.

Let’s create an MLP, specifically — this one:

Fig 3. A basic neural network. Image by author.

Let’s just go over the notation here:

The inputs to the network
The j’th neuron in the i’th layer
The weight connecting the i’th neuron of layer L-1 to the j’th neuron of the L’th layer
The bias term of the i’th neuron in the L’th layer

Our MLP contains 3 inputs, 2 hidden layers of size 2 each, and one output.

x = [Value(2.0, label='in1'),
Value(3.0, label='in2'),
Value(-1.0, label='in3')]
n = MLP(3, [2, 2, 1])
draw_dot(n(x))

Out:

output of draw_dot — a simple MLP

The full computational graph is much more detailed than the simplistic drawing we made of the network because it contains all of the multiplication, addition, and Tanh operations in the network. If we look closely though, we can see that the labels in the detailed drawing make sense. For example — find layer L_11 in the drawing and backtrack to its inputs. You will see that it gets the sum of in1 * w¹_(11), in2 * w¹_(12), in3 * w¹_(13) and b¹_1 as inputs — just like our simplistic drawing from before (only slightly less organized).

Now comes the question — how do we use this neural network? Can we train it? Of course, the answer is — yes! Let’s define some input data points and their respective labels. Our goal will be to train the network to generate the correct label for each input.

xs = [
[Value(2.0, label='in11'),
Value(3.0, label='in12'),
Value(-1.0, label='in13')],
[Value(3.0, label='in21'),
Value(-1.0, label='in22'),
Value(0.5, label='in23')],
[Value(0.5, label='in31'),
Value(1.0, label='in32'),
Value(1.0, label='in33')],
[Value(1.0, label='in41'),
Value(1.0, label='in42'),
Value(-1.0, label='in43')]
]
ys = [1.0, -1.0, -1.0, 1.0]
ypred = [n(x) for x in xs]
ypred

Out:

[Value(data=0.6159125848021643, label=L_31),
Value(data=0.6974854432825526, label=L_31),
Value(data=0.6258725834364298, label=L_31),
Value(data=0.6165492639086749, label=L_31)]

We defined 4 input samples and a corresponding label for each of them. Then, we ran theMLP we created previously to generate the output for each sample. We see that our initial guesses are quite off (instead of [1, -1, -1, 1] we got different values). In order to train a neural network we need a loss function. This is a mathematical formula that will be used to estimate the quality of a network's outputs. We will use the well know mean squared error:

loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss

Out: Value(data=5.819476097248748)

In mean squared error, we take the squared difference between each prediction and the corresponding label and sum them all up. The lossobject that we created represents this sum. If we backpropagate through lossusing backward, we will generate the gradient for each one of the network parameters. This gradient will indicate in which direction each parameter must be nudged in order to minimize loss which is our goal. Just for fun — let’s draw the graph represented by loss after we call loss.backward().

loss.backward()
draw_dot(loss)

Out:

output of draw_dot — a simple MLP with all forward passes in batch

This graph is insanely packed. It contains in it all 4 forward passes — one for each input! We can see, though, that each one of our network parameters appears here only once, and each has a gradient that now tells how we should nudge it to minimize the loss. Just as an example, let’s look at the gradient of the first weight of the first neuron of the first layer: n.layers[0].neurons[0].w[0].grad: 0.05664531168775873

The gradient is positive. This means that to minimize the loss function we need to go in the negative direction of the gradient, i.e. decrease the value of W¹_(11).

Part 11. Training the MLP

Finally! We are going to train our neural network to fit the dummy data points we showed before. To train our net we need to repeat the forward pass -> loss calculation -> backward pass procedure. We can do it manually, but it makes much more sense to write a simple loop that does this for us:

def training_loop(n_steps, learning_rate):
for k in range(n_steps):
# forward pass
ypred = [n(x) for x in xs]
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

# backward pass
loss.backward()

for p in n.parameters():
p.data -= learning_rate * p.grad

print(k, loss.data)

What is happening here? First, we generate ypredwhich is the forward pass for our 4 examples. Then we generate the loss term which we will calculate the gradients by. Then, we call backwardto calculate gradients for all parameters of the net. Finally, we iterate through all of the network parameters using the parameters()function from the Valueclass and nudge each of them by a small amount (controlled using the specified learning_rate) in the negative direction of the gradient so as to make the loss smaller.

Let’s just make sure that n.parameters()gives us the expected output:

[Value(data=0.45027414801422605, label=w^1_(11)),
Value(data=-0.8129525057615291, label=w^1_(12)),
Value(data=0.5758542981989572, label=w^1_(13)),
Value(data=-0.8030667986239539, label=b^1_1),
Value(data=0.5557411827449444, label=w^1_(21)),
Value(data=-0.6721847840083467, label=w^1_(22)),
Value(data=1.217737844619453, label=w^1_(23)),
Value(data=0.09905283405117085, label=b^1_2),
Value(data=0.4829801438596318, label=w^2_(11)),
Value(data=0.7369938633453851, label=w^2_(12)),
Value(data=-0.46073515053706376, label=b^2_1),
Value(data=0.27929425462417923, label=w^2_(21)),
Value(data=1.7138391095399366, label=w^2_(22)),
Value(data=0.7105401730722122, label=b^2_2),
Value(data=-2.378848749014902, label=w^3_(11)),
Value(data=-0.8342902851970706, label=w^3_(12)),
Value(data=-1.2437877048443844, label=b^3_1)]

Great! We see here all of the parameters we wish to train, and we don’t see things we don’t wish to train (such as the input data which is assumed to be fixed).

We are ready to run the training loop using training_loop(n_steps=20, learning_rate=0.01)

Out:

0 5.819476097248748
1 5.624553232066405
2 5.339641605929825
3 5.0032420663990465
4 4.6979176273972465
5 4.50430643973165
6 4.41612726370955
7 4.345437217507974
8 4.191726524841689
9 3.8794059504102045
10 3.3966830048541388
11 2.8554032260566666
12 2.4577023357202674
13 2.274646098166827
14 2.176592658900446
15 2.0400253245956486
16 1.8089353320227564
17 1.3966145142070974
18 0.748901313440497
19 0.17987272335400104
ypred = [n(x) for x in xs]
ypred
[Value(data=0.9312475693757956, label=L_31),
Value(data=-0.9983175660114463, label=L_31),
Value(data=-0.895522396196552, label=L_31),
Value(data=0.9153783867936369, label=L_31)]

Our training procedure works and our ypredis very close to the ground truth. With a bit more training steps we would have gotten even closer.

But wait…there is a bug here! (in the training loop) Did you spot it? Take a moment to think…

The bug is subtle but quite common among new neural network practitioners. Before each backward pass, we forgot to clear the gradient information from the computational graph. Remember that as we calculate the gradients for the different components of the graph we use +=instead of =to accumulate gradient information, because a certain node may be used more than once in the graph and to get its gradient we need to sum up all of the contributions to the gradient.

In the current implementation we forgot to clear the gradient value, so the gradient will get accumulated through subsequent forward passes — which were performed with different network weights! This makes no sense, of course, so a simple fix is in order:

def training_loop(n_steps, learning_rate):
for k in range(n_steps):
# forward pass
ypred = [n(x) for x in xs]
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))


for p in n.parameters():
p.grad = 0.0

# backward pass
loss.backward()

for p in n.parameters():
p.data -= learning_rate * p.grad

print(k, loss.data)

Before the backward pass, we iterate through all network parameters and assign a value of 0 to them.

In this toy example, it didn’t really matter, and it even helped the model converge faster because the problem is very easy, and accumulating the gradients was akin to using a massive learning rate (I needed several hundred training steps instead of 20 to reach a similar state of loss), but it is indeed a bug and could lead to crazy results in larger networks.

Let us run the training loop again after the fix. We will call training_loop(20, 0.01).

Out:

0 5.108927684153838
1 4.846717266169518
2 4.60037368477449
3 4.383622169945541
4 4.205267359786554
5 4.067195989790671
6 3.965330834751441
7 3.8924147774168065
8 3.840719199945062
9 3.803632051028429
10 3.7761592608770185
11 3.7548059106373577
12 3.7372392224767497
13 3.7219462205564553
14 3.707961320704009
15 3.6946723746238628
16 3.681690631213215
17 3.668766548696825
18 3.655736427317908
19 3.6424890071677454

Unlike before, our training is much slower. This is due to fixing the bug in which the gradients accumulated on each other in each training step instead of resetting. After about 1000 steps our output looks like this:

[Value(data=0.9653433264848924, label=L_31),
Value(data=-0.9720009732525263, label=L_31),
Value(data=-0.9618275470088311, label=L_31),
Value(data=0.9694722216419711, label=L_31)]

Very close to the ground truth.

Let’s now wrap up everything we learned here.

Part 12. Summary

Which question did we answer today?

  • What are neural networks? They are mathematical expressions that take weights and input data as parameters and generate some output. The quality of this output is measured via a loss function, which we then use to calculate gradients for all of the network weights (using the chain rule). To minimize the loss function we nudge each parameter by a small amount in the negative direction of the gradient. Hopefully, after many steps of this process, we will be able to minimize the loss function enough, and this means that our network will generate outputs similar to the expected results.

We created a very basic neural network in this article, for simplicity’s sake, but these networks can get much larger and perform insanely difficult tasks! Such as creating Python code given a free text description, making medical diagnoses given patient records, and much more!

The incredible thing is that the underlying concepts behind these massive and impressive neural nets are exactly the same as the simple example we showed here today. Congratulations on understanding some pretty important things!

The notebook containing the code that was shown here — https://colab.research.google.com/drive/1lG93TNxmdHFLb7xiZquzTRDF5hFcVQkV?usp=sharing

References

--

--