# Automatic Differentiation in PyTorch

Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need to worry about partial derivatives, chain rule, or anything like it.

To illustrate how it works, let’s say we’re trying to fit a simple linear regression with a single feature x, using Mean Squared Error (MSE) as our loss:

We need to create two tensors, one for each parameter our model needs to learn: b and w.

Without PyTorch, we would have to start with our loss, and work the partial derivatives out to compute the gradients manually. Sure, it would be easy enough to do it for this toy problem, but we need something that can scale.

So, how do we do it? PyTorch provides some really handy methods we can use to easily compute the gradients. Let’s check them out!

The latter requires the computation of its gradients, so we can update their values (the parameters’ values, that is). That’s what the requires_grad=True argument is good for. It tells PyTorch to compute gradients for us.

Remember: a tensor for a learnable parameter requires a gradient!

In code, creating tensors for our two parameters looks like this:

`device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomlytorch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)`

# backward

Do you remember the starting point for computing the gradients? It is the loss, which we would use to compute its partial derivatives with respect to our parameters.

Hence, we need to invoke the backward() method from the corresponding Python variable: loss.backward().

The code below illustrates it well, assuming we’re making both predictions and computing the loss using nothing but Numpy:

`# Step 1 - Computes our model's predicted output - forward passyhat = b + w * x_train_tensor# Step 2 - Computes the loss# We are using ALL data points, so this is BATCH gradient descent.# How wrong is our model? That's the error!error = (y_train_tensor - yhat)# It is a regression, so it computes mean squared error (MSE)loss = (error ** 2).mean()# Step 3 - Computes gradients for both "b" and "w" parameters# No more manual computation of gradients!loss.backward()`

Which tensors are going to be handled by the backward() method applied to the loss?

• b
• w
• yhat
• error

We have set requires_grad=True to both b and w, so they are obviously included in the list. We use them both to compute yhat, so it will also make it to the list. Then we use yhat to compute the error, which is also added to the list.

Do you see the pattern here? If a tensor in the list is used to compute another tensor, the latter will also be included in the list. Tracking these dependencies is exactly what the dynamic computation graph is doing, as we’ll see shortly.

What about x_train_tensor and y_train_tensor? They are involved in the computation too… but they contain data, and thus they are not created as gradient-requiring tensors. So, backward() does not care about them.

`b.grad, w.grad`

OK, we got gradients, but there is one more thing to pay attention to: by default, PyTorch accumulates the gradients. How to handle that?

# zero_

`# This code will be placed after Step 4 (updating the parameters)b.grad.zero_(), w.grad.zero_()`

So, we can definitely ditch the manual computation of gradients and use both backward() and zero_() methods instead.

That’s it? Well, pretty much… but there is always a catch, and this time it has to do

with the update of the parameters

# Updating Parameters

`lr = 0.1`

And then use it to perform the updates:

`# Attempt at Step 4b -= lr * b.gradw -= lr * w.grad`

But, it turns out we cannot simply perform an update like this! Why not?! It turns out to be a case of “too much of a good thing”. The culprit is PyTorch’s ability to build a dynamic computation graph from every Python operation that involves any gradient-computing tensor or its dependencies.

This time, the update will work as expected:

`# Step 4, for realwith torch.no_grad():    b -= lr * b.grad    w -= lr * w.grad`

Mission accomplished! We updated our parameters b and w using PyTorch’s automatic differentation package, autograd.

I mean, we updated it once. To actually train a model, we need to place this code inside a loop. Putting it all together, and adding a loop to it, the code should look like this:

`device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomlytorch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)lr = 0.1for epoch in range(200):    # Step 1 - Computes our model's predicted output - forward pass    yhat = b + w * x_train_tensor    # Step 2 - Computes the loss    # We are using ALL data points, so this is BATCH gradient descent.    # How wrong is our model? That's the error!    error = (y_train_tensor - yhat)    # It is a regression, so it computes mean squared error (MSE)    loss = (error ** 2).mean()    # Step 3 - Computes gradients for both "b" and "w" parameters    # No more manual computation of gradients!    loss.backward()     # Step 4, for real    with torch.no_grad():        b -= lr * b.grad        w -= lr * w.grad    # This code will be placed after Step 4 (updating the parameters)    b.grad.zero_(), w.grad.zero_()`

That was autograd in action! Now it is time to take a peek at the…

# Dynamic Computation Graph

– Morpheus

I want you to see the graph for yourself too!

The PyTorchViz package and its make_dot(variable) method allow us to easily visualize a graph associated with a given Python variable involved in the gradient computation.

So, let’s stick with the bare minimum: two (gradient computing) tensors for our parameters (b and w) and the predictions (yhat) — these are Steps 0 and 1.

`make_dot(yhat)`

Running the code above will show us the graph below:

Let’s take a closer look at its components:

• blue boxes ((1)s): these boxes correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for
• gray box (MulBackward0): a Python operation that involves a gradient-computing tensor or its dependencies
• green box (AddBackward0): the same as the gray box, except that it is the starting point for the computation of gradients (assuming the backward() method is called from the variable used to visualize the graph) — they are computed from the bottom-up in a graph

Now, take a closer look at the green box at the bottom of the graph: two arrows are pointing to it since it is adding up two variables, b, and w*x. Seems obvious, right?

Then, look at the gray box (MulBackward0) of the same graph: it is performing a multiplication, namely, w*x. But there is only one arrow pointing to it! The arrow comes from the blue box that corresponds to our parameter w.

Why don’t we have a box for our data (x)?

The answer is: we do not compute gradients for it!

So, even though there are more tensors involved in the operations performed by the computation graph, it only shows gradient-computing tensors and its dependencies.

What would happen to the computation graph if we set requires_grad to False for our parameter b?

`# New Step 0b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)# New Step 1yhat = b_nograd + w * x_train_tensormake_dot(yhat)`

Unsurprisingly, the blue box corresponding to the parameter b is no more!

Simple enough: no gradients, no graph!

The best thing about the dynamic computation graph is the fact that you can make it as complex as you want it. You can even use control flow statements (e.g., if statements) to control the flow of the gradients.

The figure below shows an example of this. And yes, I do know that the computation itself is complete nonsense

`b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device)yhat = b + w * x_train_tensorerror = y_train_tensor - yhatloss = (error ** 2).mean()# this makes no sense!!if loss > 0:    yhat2 = w * x_train_tensor    error2 = y_train_tensor - yhat2# neither does this :-)loss += error2.mean()make_dot(loss)`

Even though the computation is nonsensical, you can clearly see the effect of adding a control flow statement like if loss > 0: it branches the computation graph in two parts. The right branch performs the computation inside the if statement, which gets added to the result of the left branch in the end. Cool, right?

# To be continued…

Don’t miss my talk at ODSC Europe 2020: “PyTorch 101: building a model step-by-step.”

The content of this post was adapted from my book “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”. Learn more about it at http://leanpub.com/pytorch.

Daniel is a data scientist, developer, and author of “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”.

He has been teaching machine learning and distributed computing technologies at Data Science Retreat, the longest-running Berlin-based bootcamp, for more than three years, helping more than 150 students advance their careers.

His professional background includes 20 years of experience working for companies in several industries: banking, government, fintech, retail and mobility.

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

## More from ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

## Facebook Prophet For Uni-variate Time Series

Get the Medium app