Understand Autograd: A Bottom-up Tutorial

Huizi Mao
2 min readOct 21, 2019


You may have wondered how autograd actually works in imperative programming. In this post, I am going to explain it with hand-by-hand examples. Unlike other tutorials, this post is not borrowing one single line of codes from PyTorch or MXNet, but instead building everything from scratch.

First of all, the term Autograd, or Automatic Differentiation, does not essentially mean calculating the gradients; that should instead be referred to as symbolic differentiation or numerical differentiation. A more precision definition of Autograd should be “automatically chaining the gradients”. Recall the chain rule of differentiation:

That’s it. Autograd calculates dL/dx from dL/dy if the derivative of y(x) (which we name as a primitive) is implemented. Assume you passed calculus in the freshman year, you now should already have got 80% of Autograd’s idea: chaining the gradients of primitive functions.

Two basic components of an autograd library: backward-able variables and backward-able functions.

Now let’s delve into the programming part. Throughout this tutorial, we always assume scalar functions—vectorization does not alter the mechanism of autograd. We subclass the float type to add grad attribute and backprop method. We name this new class float_var, the variable of type float.

Our first example is for univariate functions. The aforementioned goal is to trace the call order of the forward function (the order of backward funtion will be the reverse) on the fly, and we use Python’s functional programming feature to implicitly record this simple graph. For the univariate case, all it requires is to record the last function call. Super easy, isn’t it? Here is the code:

If you are already familiar with Python decorator, you should find this function similar. It couples a forward function fw_func and a backward function bw_func (both input/output float) and wraps them into an autograd function (input/output float_var).

The output val_out will be registered with a new backprop function that automatically calls the bw_func, multiplied by val_in’s backprop function (recall the chain rule). In this way, all bw_funcs are chained in the reverse order of fw_funcs are called. The example in notebook:

Additional notes:

(They does not affect your understanding of autograd, but may be helpful for you to understand this example.)

When creating a new backward function, we are actually creating a closure (a function plus the environmental variables).

We are replacing a class function with a normal function, and types.MethodType makes this substitution process smooth.

We will cover multivariate functions in the next part. For the multivariate case, we will start to face problems regarding system design.

