AI and Calculus: The Vanishing Gradient
Intro:
We have learned in our Calculus class about the basic derivatives, integrals, etc. In the 2018 AP Calculus BC exam, we had to write an equation to solve for the tangent line of the internal temperature of a potato at time t minutes…definitely a great use for calculus (laughs). Calculus has many applications, and a great application of it is used in Artificial Intelligence. Let’s take a look at key calculus concepts and apply them to the real world!
AI and Calculus:
The gradient descent is an algorithm that finds the local minimums and maximums of a function. Neural Networks(NN) gradients use backpropagation, where it goes through the various layers of the model to find the derivative (or the error) of the initial layer using chain rule. The NN takes parameters of weights and biases and updates the values using the gradient descent algorithm. This algorithm is used in AI/ML, engineering, and industrial fields.
The Vanishing Gradient Problem:
Using backpropagation, the gradient exponentially decreases and approaches 0 as more layers are being multiplied. This does not change the initial layer’s weights and biases, only the outer layers. If the derivative output is less than 1 it will converge faster but not at a significant rate. This makes it hard for us to train the model due to the initial layers not updating, creating low accuracy validation even at the best epoch number.
Zeno’s Paradox:
Zeno’s Paradox states that
“if a person wants to travel from Point A to Point B, they must first walk half that distance in a finite amount of time. Thereafter, they must walk half the remaining distance again in another finite amount of time. Then, they must walk half of the remaining distance again, and so on. By continually halving the remaining distance, the person will walk an infinite interval of distances and still remain slightly away from their final destination” (Michael Mo)
The vanishing gradient problem is a real life representation of Zeno’s paradox because both scenarios attempt to approach an asymptote but will never reach it.
Calculus:
Do not fear calculus! It is a helping hand for us to write the gradient descent algorithm. Let’s now learn key calculus concepts that will be used in the gradient descent algorithm. We will then explore how to find a solution to that nasty gradient divergence.
The Gradient Descent only works on functions that are differentiable and concave up. The algorithm finds the next point by taking the gradient at the current point and then applying a certain ratio that is small enough to cause convergence, to add or subtract this value from the current point, with the end goal of maximizing or minimizing the function. That was a lot of words, take a look at the pseudo code below to get a mental picture of how it works!
Gradient
The Gradient of a multivariate function is a vector of derivatives. We will be using the partial derivative because it gives us the derivative at a point on the function in a certain direction(specific axis).
Differentiable
The function is continuous and the derivative from the left equals the derivative from the right. For a function to be continuous, the limit of the function from the left and right needs to equal the value at that point.
Concavity
To ensure our function is concave up, we must take the 2nd derivative of the function and check to see if it is greater than 0.
Therefore, our function is concave up on
Partial Derivative
The Gradient Descent Algorithm uses partial derivatives in its equation, so let’s take a look at what a partial derivative is. With multivariate functions, partial derivatives find the rate of change of only one of the independent variables.
The Solution to the Vanishing Gradient Problem:
Now that we learned about how The Gradient Descent algorithm uses calculus concepts, we still need to discuss the solution to the vanishing gradient. One solution we are going to focus on is the Rectified Linear Unit (ReLU).
ReLU is an non-saturating activation function used in creating AI models to output range of 0 to positive infinity. ReLU allows the derivative to be 0 or 1. The value of 1 helps not further minimize the partial derivative. This allows for overall larger derivative values that helps ensure that the gradient does not diverge as we multiply the many layers together. Unfortunately, when you get a continuous number of 0 as the output, that means the neuron is dead: dying ReLU. There are many other solutions beside ReLU that will help solve the vanishing gradient problem like, LReLU, Residual Networks, and Batch Normalization.
We will now look at some code that I used to build a model using ReLU. The following code is a Convolutional Neural Network(CNN) model. This is more complex than the NN we were discussing earlier. Unlike NN with multiple hidden layers, the CNN has 3 dimensions with parameters of width, height, and depth. ReLU is used in the hidden layer on line 3 of the code, and helps solve the vanishing gradient problem. You can see other activation functions also being used, like SoftMax — which is used in the output layer to output the distribution of probability for labeled classes.
Here is the output after feeding in some data of street images into the model above, and we find a validation accuracy of around 85.17%. Adjusting the hidden layers and the types of activation functions used will help increase the final validation accuracy at the best Epoch number.
Conclusion:
In this article, we did an in depth analysis on the Gradient Descent and the calculus involved in the algorithm. We also discussed how it is used in NN/CNN and different solutions to increase Validation Accuracy at the best epoch value.
I wrote this article after creating various AI models and noticed something hindering the validation accuracy. I researched further into what could cause a reduced value and came across The Vanishing Gradient Problem. The Vanishing Gradient and AI/ML in general is heavy in math which was intimidating at first but after learning a few basic calculus concepts(with the help of my teacher Mr. Wernau), you can start to see how it all relates together. It is pretty fascinating that what you learned in class is actively being used in the real world! :)
For Further Learning: