Auto Differentiation with TensorFlow
Let's unravel the mechanism of auto differentiation!
If people reading this are new to Neural Networks and stuffs then most of the things in this blog will be pretty hard to interpret. So I would suggest reading Neural Network 101 before jumping into this.
We know that in deep learning, we tend to have a loss function that gets differentiated with respect to our model's parameters (weights and biases) and this goes on and on till the parameters find the best values. But what's the underlying mechanism powers all these computations?
It's Calculus for you.
But this isn't a blog about Calculus 101, instead, we will look into Automatic Differentiation which acts as a foundation upon which deep learning frameworks lie. Since we all know that deep learnings models typically work on gradient-based techniques, and auto differentiation makes it easy for us to get the gradients for the complex or deep models.
Why do we need auto differentiation?
For instance, for the below function we can easily calculate the partial derivates, that is calculating the gradients for both w1 and w2 and the equations are partial derivatives.
def f(w1, w2):
return 3* w1**2 + 2*w1*w2
Here our function seems really simple but this won't be the case when we are constructing a deep neural network we will be having a function like this 10x complexity. To be more precise and in math terms, we have to calculate the chain rule thousands of times.
Also, we will be calculating the partial derivative during the forward pass and for a complete neural network, this will be huge in numbers, and it's almost impossible to keep track of these derivatives and we will be needing these partial derivatives back while computing the gradients.
Alright, so much jargon lets take a moment and see what are those terms really mean to us.
Deconstructing the jargons
Well, the first time when we are computing our loss function with our parameters it will return the partial derivatives and we call this process forward pass.
The forward pass is responsible for computing the loss function with our parameters. But we know that a neural network has to optimize its parameters to achieve the best results and that is getting a minimized loss error.
But how do we find the values that will help the neural network to find the best parameters in order to minimize the loss?
Gradients.
We have to get the gradients by activating the backpropagation (or) back pass. At first, we performed a forward pass and got our partial derivatives, and by activating the backpropagation that uses the chain rule to compute the gradients for us.
But what do all of these things have to do with auto differentiation?
Auto differentiation helps us to keep track of these computations and during the backpropagation, it just has to use these parameters to compute the gradients. And we know just with the help of partial derivatives we were able to compute the gradients of the trainable variables (weights and biases) and still able to keep a record of thousands of derivatives and gradients.
Alright enough of the theory, let's jump into some code and wrap this thing up. Let use the above equation for our experimentation and we’re going to use TensorFlow for this.
w1 , w2 = 5 ,3 # params
eps = 1e-6 # learning rate to step the params# Doing the computation by hand
(f(w1 + eps, w2) - f(w1, w2)) / eps
The above function f() needs to be called for every parameter and this will be a tedious process for large neural networks. So let's prettify this by using TensorFlow’s GradientTape which uses the auto differentiation mechanism.
[ <tf.Tensor: id=828234, shape=(), dtype=float32, numpy=36.0>,<tf.Tensor: id=828229, shape=(), dtype=float32, numpy=10.0> ]
How does this work?
- At first, we define two variables
w1
andw2
. tf.GradientTape()
will automatically record every operation that involves a variable (only the trainable variables, tf.constants)- And we ask the tape to compute the gradients of the result (loss) z with regard to both variables [
w1
andw2
].
Not only is the result accurate but even the precision is limited to floating-point errors. One important thing we gotta keep in mind is, gradient()
the method only goes through the recorded computations once (back pass - reverse order) no matter how many variables are there.
We can also pause the recordings by creating tape.stop_recording()
inside the tf.GradientTape()
block.
# Returns the gradient and tape is erased
dz_dw1 = tape.gradient(z, w1) # Returns error RUNTIMEERROR
dz_dw2 = tape.gradient(z , w2)
tape
is automatically erased immediately after we call its tape.gradient()
. But if we need to call gradient()
more than once we can use persistent parameter.
The tape persistent calls the gradient()
and deletes it each time after the execution.
# Setting persistent to True with tf.GradientTape(persistent = True) as tape:
tape.watch(x)
The tape.watch()
can be used at times when we are implementing a custom regularisation loss that penalizes activations that vary a lot when the inputs vary a little, but this loss will not be based on the tf.Variables
, and most probably this will be of tf.constant
tensor.
You can learn more about this in TensorFlow documentation and I will attach some links below too. And if there is anything feel free to point them out in the comments.
Until then,
Next time.
- Intro to auto differentiation: https://www.tensorflow.org/guide/autodiff
- Reverse mode auto diff from scratch: https://sidsite.com/posts/autodiff/