Gradients and Gradient Tape — Part 1
Gradients — Introduction
Mathematically, at the core of Neural Networks lies a complex and challenging curve fitting problem. Curve fitting is an optimization problem and our objective is to find the parameters of a curve that is the best fit for the dataset. This is one of the most important steps in any Neural Network and the inability to find the best fitting curve can have a huge impact on the end result.
The topic we’ll be exploring today is quite vast so I’ve divided it into different parts. The first part will cover the basics of gradient computation followed by some simple examples of Tensorflow’s Gradient Tape.
Before deep diving into the specifics of Gradient tape, let us build an intuitive understanding of gradients.
In simple terms Gradient is slope.
The figure above displays the points on the 2D plane for line y = 0.5*x+2
Here the slope of the line (gradient) is 0.5
For curves, the gradient at any point is the derivative of the function f(x) at a particular x. The direction of slope is dependent on the location of calculation. For example, in the figure below,
At x=-2 the gradient is negative (points towards -y direction) and when computed at x=2, the gradient is positive (points towards +y direction).
Now that we have a basic idea about gradients, let us see how we can use Tensorflow’s Gradient Tape to calculate gradient.
Introduction to Gradient Tape
Gradient Tape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow.
Automatic differentiation refers to a set of techniques for evaluating gradients. It is extensively used in modern machine learning tasks. Libraries like Tensorflow, PyTorch and several others have made it easier to calculate the derivatives of arbitrary functions. For our blog, we will only focus on Tensorflow’s implementation of Autodiff via Gradient Tape.
How is this technique useful in AI?
Autodiff can be used to compute the partial derivative of a function at a particular point and this comes very handy while solving curve fitting probems. As for Neural Networks, Backpropagation, which is a way to compute the gradients needed to fit the parameters of a Neural Network is a special case of Automatic Differentiation. We will see how we can implement Autodiff using Gradient Tape in Part 2 of the blog. For now, we will look into some simple implementations of Gradient Tape to understand how it works.
Getting Started with Gradient Tape
Example 1: Find the gradient of y=x³ at x=5 and x=7
Let us start by importing the necessary packages. For our example we require numpy and tensorflow.
Next, we write a function for computing y=x³. We will use Gradient tape to compute the Gradient dy/dx. TensorFlow “records” relevant operations executed inside tf.GradientTape onto a tape (in the code below, it is represented by variable t) which Tensorflow uses to compute the gradients of a recorded computation using reverse mode differentiation. We won’t be going into the details of reverse mode differentiation but I have included a blog on reverse mode differentiation in the Reference section. Feel free to check it out!
Note that watch records the actions performed on tensor x. This is also known as a hook and it can be used to filter out which values need to be monitored. After this we use t.gradient to calculate the gradient of the function at a particular point. The snippet below displays the function call to get the gradient of y=x³ at x=5
On printing result, we get
To validate the results manually, we compute dy/dx for y=x³, which is 3x². Substituting the values x=5 to 3x² we get 75 as the result.
Example 2: Finding the intermediate gradient and implementing Chain Rule in Partial Differentiation
A simple case of Chain Rule for partial differentiation can be mathematically explained as follows:
For z=f(y) and y=f(x),
dz/dx=(dz/dy)*(dy/dx)
For our example, we will use the following equations,
z=y²
y=x²
Our objective is to compute dz/dx at x=3
We know dz/dy=2y…(i) and dy/dz=2x…(ii)
Substituting the value y=x² in (i) we get, dz/dy=2x²…(iii)
Applying Chain Rule,
dz/dx=(dz/dy)*(dy/dz)…(iv)
Replacing dz/dy and dy/dz using equations (ii) and (iii) we get,
dz/dx=2x²*2x…(v)
At x=3, the value of dz/dx is 108
The reference section contains the link to a blog for more details on Chain Rule. We will start coding now.
Let us start by writing a function for gradient calculation. The code is very similar to Example 1, with the addition of an extra step for both the computations.
As you can see above, we’ve included line 12 and 13 for the two equations. Gradient dz/dx is computed in line 15.
The result is printed as shown below,
Example 3: Computing intermediate values for Example 2
In some cases, we might need to monitor the intermediate values of dy/dx along with dz/dx. However, Gradient Tape by default expires after one use. To make sure that the gradient tape can be used multiple times, we will use an additional parameter persistent and set it to True.
Let us see what happens if we don’t use persistent=True. The function below returns the value of dy/dx along with dz/dx. Notice that persistent hasn’t been set yet.
On running this snippet, the following error pops up,
Let us modify the following function to set the persistent parameter as True
After running this, we notice the error from the previous snippet has been resolved. We can print the intermediate values as follows,
The notebook for the examples is available at https://github.com/MohanaRC/GradCam_Tensorflow_2/blob/main/Getting%20Started%20with%20Gradient%20Tape.ipynb
References:
- Tensorflow Advanced Specialization by Coursera and DeepLearningAI
- Autodiff: https://www.tensorflow.org/guide/autodiff
- Reverse mode differentiation: https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation
- Chain rule: https://tutorial.math.lamar.edu/classes/calciii/chainrule.aspx