# Unraveling Automatic Differentiation

Gradients and derivatives are at the heart of all Machine Learning Algorithms. The reason why ML just “works” is because of gradients. The underlying idea is to minimize the objective loss function, which depends on the input and parameters of the model. This is done beautifully and elegantly using multivariate calculus and the concept of gradients.

Anybody who has completed a basic course in calculus, would find the process of calculating derivatives and gradients of a differentiable function a very mechanical and straightforward process, thanks to the Chain Rule. But how does a computer calculate derivatives for any general differentiable function?

The straightforward approach of manually computing the derivative will not work because this process is heavily dependent on the function being differentiated and is not general. Let’s go back to the first principle definition of derivative.

But the accuracy of this approach is highly dependent on the choice of *h* as *h* tends to 0, and is a very naive method to compute the derivative of a function at a point.

One of the most important outcomes in mathematical research of 20th century, was Automatic Differentiation(AD), also known as algorithmic differentiation. Most ML/DL libraries use AD to calculate gradients and derivatives. It has two modes: Forward mode and Reverse mode. Let’s first understand the forward mode as it is very intuitive and exploits the chain rule of partial differentiation.

**Forward Mode AD**

A very clever way to implement Forward mode AD is through Dual numbers. Dual numbers are very similar to complex numbers and have two components.

In complex numbers we have i² = -1, likewise in dual numbers we have ϵ² = 0, but counter-intuitively, ϵ is not equal to 0. ϵ can be understood as a very small number within the precision of floating point numbers, but when it is squared, it becomes zero and cannot be expressed within floating point precision. The arithmetic of dual numbers is exactly same as complex number arithmetic.

Now while using dual numbers for AD, we set the real part as the value of the function at a point, and the imaginary part is set to be the derivative of the function at that point.

Let *f*(*x) = x *be written as *f(x*+*εx*′*)* = *x*+*εx*′. This is the identity function.

(*x*+*εx*′)+(*y*+*εy*′) = (*x*+*y*)+*ε*(*x*′+*y*′)

(*x*+*εx*′)(*y*+*εy*′) = (*xy*)+*ε*(*xy*′+*x*′*y*)

*f*(*g*(*x*+*εx*′)) = *f*(*g*(*x*)+*εg*′(*x*)*x*′) = *f*(*g*(*x*))+*εf*′(*g*(*x*))*g*′(*x*)*x*′ (Chain Rule)

It is clear from the above examples that dual number arithmetic obeys the chain rule in accordance to the convention that real part of the dual number is the value of the function, and the imaginary part is the derivative of the function.

Let us now consider a differentiable function, and obtain its gradient through Forward Mode AD. Let

*y=f(x1, x2) = ln(x1)+x1*x2−sin(x2)*

Any function can be written as a composition of basic mathematical operations such as addition, subtraction, multiplication, division and functions such as square root, trigonometric functions, logarithmic and exponential functions. We define the derivative of these basic functions as rules, and use these rules to obtain derivatives of their composite functions.

So if we break down this given function as a composition of basic functions, we obtain a computation graph with the output of each basic function/operation as nodes.

Here *v-1 and v0 *are the input nodes and the rest are intermediate functions which lead to the desired output function.

The left column of the table shows the computation of the value of the function at *x1 = 2 and x2 = 5. *The relation between intermediate nodes is clear from the graph. The right column shows the partial derivative of each intermediate function with respect to *x1.* Since *v-1 = x1*, the partial derivative of *v-1* with respect to *x1* is 1 and since *v0 = x2*, and *x1* and *x2* are independent, the partial derivative of *v0* with respect to *x1* is zero.

From thereon, the rest of the functions and their partial derivatives with respect to *x1*, can be obtained using the chain rule, and eventually we will obtain the partial derivative of *f* with respect to *x1*. The value of derivative is exact and accurate.

So if we have a function with *n* input variables, we will have to perform *n* forward passes to obtain all the partial derivatives. Since the flow of input to obtain output, as well as the flow of partial derivatives is from input to output, this mode is called Forward mode.

Julia has a package for Forward mode AD called ForwardDiff.jl which defines a structure called Dual having the same properties as described above. It uses a set of rules for differentiation of basic functions, to obtain the partial derivatives of any differentiable function using the chain rule.

**Reverse Mode AD**

In the reverse mode, the gradients are propagated backwards from the output using a generalized backpropagation algorithm. This is done by complementing each input variable v with an adjoint defined as *v=∂y/∂v.*

In reverse mode AD, derivatives are computed in the second phase of a two-phase process:

- In the first phase, the original function code is run forward, populating intermediate variables
*vi*and recording the dependencies in the computational graph. - In the second phase, derivatives are calculated by propagating adjoints in reverse, from the outputs to the inputs.

Let’s use the same function as before to understand reverse mode AD. The computation graph is also the same as before. We shall calculate the gradient of the function at *x1 = 2 and x2 =5*.

*y=f(x1, x2) = ln(x1)+x1*x2−sin(x2)*

The function is broken down into basic functions to form a computation graph.

In the first phase, we will complete one forward pass to compute all intermediate variables v and hence the output of the function *f at x1 = 2 and x2 = 5. *In the second phase, we start with the output variable.

*̄v5= ̄y=∂y/∂y= 1. *This is because* v5 = y.* Hence the partial derivative is 1. Now we backpropagate to *v4, v3,* until we reach the input variables as shown in the right column of the table above. Again the rules of differentiation of basic functions and operations has been defined. We arrived at the same value of *∂y/∂x1 *through reverse mode AD as well as forward mode AD.

The beauty of this approach is that we can obtain the partial derivatives of the function with respect to all input variables in just one forward and backward pass. So if we have an input vector of *n* dimensions and a scalar output function, just one forward and backward pass of reverse mode AD is sufficient to obtain all the partial derivatives of the function. Whereas forward mode AD would require *n* forward passes for each input variable.

If the output function is of *m* dimensions, then we would need *m* forward and backward passes of reverse mode AD to obtain the gradients of each output dimension with respect to the input dimension. The name reverse mode AD is justified as the function computation happens from input to output, but gradients are computed in reverse direction i.e, from output to input.

Julia has reverse mode AD implemented in the package ReverseDiff.jl, which backpropagates adjoints as described above to compute the gradients.

Automatic Differentiation is a very powerful weapon which is widely used in machine learning. We definitely would not have so many libraries for ML/DL, if we could not compute gradients efficiently. Pytorch and Tensorflow are also based on the principles of automatic differentiation.

**References:**