Demystifying AutoGrad in Machine Learning

4 min readJan 22, 2024

Computation Graph, source: https://sebastianraschka.com/pdf/lecture-notes/stat453ss21/L06_pytorch_slides.pdf

This article is part of the series Demystifying Machine Learning.

What is AutoGrad?

AutoGrad, also known as reverse mode automatic differentiation, is a technique used in the gradient descent process of machine learning. Autograd automatically computes gradients by applying the chain rule in reverse order through the computation graph. With Autograd, you only need to define the loss function of the model, without having to manually define its gradient function.

Autograd is supported by major ML frameworks such as PyTorch and TensorFlow. The following is an example in PyTorch:

import torch

# Create some synthetic data
x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)  # Input feature
y_true = torch.tensor([2.0, 4.0, 6.0, 8.0])  # True target values

# Define model parameters (weights and bias)
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Forward pass
y_pred = w * x + b

# Calculate loss (Mean Squared Error)
loss = torch.mean((y_pred - y_true)**2)

# Backward pass (Autograd computes gradients automatically)
loss.backward()

# Gradients are computed
print("Gradient of w:", w.grad)
print("Gradient of b:", b.grad)

Why is AutoGrad such a game-changer?

Before AutoGrad, calculating gradients was a tedious, error-prone manual process. Imagine training a complex model with hundreds of layers and countless parameters — manually calculating gradients for each one would be a nightmare! AutoGrad removes this burden, freeing us to focus on designing and building better models.

How does AutoGrad work?

Let me explain how it works with examples:

Forward Pass:

In the forward pass, you evaluate your neural network or computation graph by passing input data through it. This involves calculating intermediate values and eventually producing the final output.
Let’s consider a simple example: a computational graph for a linear regression model with one input feature `x`, one weight `w`, and a bias `b`. The output `y` is calculated as `y = wx + b`.

2. Reverse Pass (Backward Pass):

In the backward pass, you compute the gradients of the loss function with respect to the model’s parameters (weights and biases). These gradients indicate how much each parameter should be adjusted to minimize the loss.
The process starts from the output and works its way backward through the graph.

3. Chain Rule:

Backward mode autograd relies on the chain rule of calculus to calculate gradients. It decomposes the gradient computation into a series of smaller steps.
For our linear regression example, let’s assume a mean squared error (MSE) loss function: `L = (y — y_true)²`, where `y_true` is the true target value.

4. Gradient Calculation:

Starting from the output `L`, you calculate its gradient with respect to `y`, which is simply `2(y — y_true)`.
Then, you move backward to compute the gradient of `L` with respect to `w` and `b` by applying the chain rule. In this case:
Gradient of `L` with respect to `w`: `dL/dw = (dL/dy) * (dy/dw) = 2(y — y_true) * x`
Gradient of `L` with respect to `b`: `dL/db = (dL/dy) * (dy/db) = 2(y — y_true)`

The following code is a simple implementation which demystifies AutoGrad. The code is available in this colab notebook:

class Variable:
    def __init__(self, value, parents=None, op=None):
        self.value = value
        self.gradient = 0.0
        self.parents = parents or []
        self.op = op

    def backward(self, grad=1.0):
        self.gradient += grad
        for parent, local_grad in self.parents:
            parent.backward(grad * local_grad(self))

    def __add__(self, other):
        other = other if isinstance(other, Variable) else Variable(other)
        return Variable(
            self.value + other.value,
            parents=[(self, lambda _: 1), (other, lambda _: 1)], op='+')

    def __mul__(self, other):
        other = other if isinstance(other, Variable) else Variable(other)
        return Variable(
            self.value * other.value,
            parents=[
                (self, lambda v: other.value),
                (other, lambda v: self.value)],
            op='*')

    def __sub__(self, other):
        other = other if isinstance(other, Variable) else Variable(other)
        return Variable(
            self.value - other.value,
            parents=[(self, lambda _: 1), (other, lambda _: -1)], op='-')

    def __truediv__(self, other):
        other = other if isinstance(other, Variable) else Variable(other)
        return Variable(
            self.value / other.value,
            parents=[
                (self, lambda v: 1/other.value),
                (other, lambda v: -self.value/other.value**2)],
            op='/')

    def __repr__(self):
        return f"Variable(value={self.value}, grad={self.gradient})"

# Test
x = Variable(5)
y = Variable(2)
z = x * y + x / y  # Example function
z.backward()  # Backward pass to compute gradients
print(f"x: {x}, gradient: {x.gradient}")
print(f"y: {y}, gradient: {y.gradient}")

Output:

x: Variable(value=5, grad=2.5), gradient: 2.5
y: Variable(value=2, grad=3.75), gradient: 3.75

Conclusion

AutoGrad liberates developers from manually calculating gradients, significantly simplifying model training and experimentation. This technique is the foundation for training deep learning models.

References

What is Automatic Differentiation?

PyTorch Autograd Explained — In-depth Tutorial

Automatic Differentiation with PyTorch