Understanding The Mathematical Concept Behind Gradient Descent
In machine learning, the basic routine is as follows, Feeding a data-set to a machine learning program, The program derives a function that models that data-set, and then the program can then start to make predictions based on its derived function.
It turns out that one of the major goals of the machine learning program while trying to come up with that function is to find a way to optimize that function to model the given data-set as best as possible.
And there are several known methods that could be used during the optimization process. One of such methods which has proven to be very popular is Gradient Descent.
Before explaining the concept of Gradient Descent and what mathematical formulas that might be required to implement it, I would like to talk a little bit about some other related concepts and terminologies that some or most of the readers of this post should already be familiar with.
Understanding What Optimization Is
Let’s say we have a data-set that consists of a set of input(x) and a corresponding set of output(y) as shown below, Then the goal of a machine learning program is to create an optimized function(f) that represents the relationship between the input and corresponding output in our data-set, So that when an input from the data-set is provided to the function, the function can predict a corresponding output(z)
If that function is found, then our machine learning program can always predict output values(z) for any input values(x) even if they did not exist in our original data-set. Mathematically we could define that function as follows:
Where f is the function, x is the input and z is the predicted output from our function.
So the optimization process as we mentioned earlier would always try to ensure that the Predicted value(z) would match the actual output value(y) from the data-set.
If there is a difference between the actual output(y) and the predicted output(z) then that difference is referred to as the error(e) in the predicted value.
From table above, we can see that the data-set is a linear data-set, and by saying linear, i mean as the input values increases, then the output values increases as well. If we created a plot of the x, y values from the data-set given then the z values is represented by the line that passes through the points as illustrated in the graph below.
Consequently z can be seen as that function in our machine learning program and can be represented mathematically as follows
In the equation above,
m is the slope of the line and b is the intercept of that same line along the x-y axis on the graph, and both m and b are the parameters of the linear function f(x) that defines the relationship between the input and predicted output values of the data-set.
So optimizing that function involves iteratively changing m and b so that the predicted values (z) can match the actual values (y) as best as possible, or so that the error(e) can be close to or equal to 0, or so that the line that passes through the points on the graph above can fit in properly.
Considering our linear function as defined by our linear data-set earlier
And Error being :
We can also call the function (z- y) which is equal to the error(e) our cost function(j) and usually the cost function is written as the square of the error(e).
The reason why we square the Difference (e) is that sometimes e might take a negative value, so squaring it would eventually produce a positive value.
Now you should note that this Cost function as stated above only refers to one (z,y) pair from the data-set, and usually our machine learning function works with several pairs of (z,y). So to find the cost function(j) for the whole data-set, it means we need to get the average value of the cost function for the total (z,y) pair derived from the data-set. Which is Referred to as the Mean Squared Error
The cost function for the whole data-set is therefore defines as :
Like we said earlier optimization ensures the value of the error(e) be close to 0, likewise we can say optimization ensures the value of the cost function be close to 0.
And to use known terms, we refer to this optimization process as : Minimizing The Cost Function.
How Do We Minimize The Cost Function (Gradient Descent Explanation)
In the meantime and most especially for simplicity sake let’s assume that our linear function as derived by our machine learning program only requires one parameter(m) which is the slope(meaning our intercept b = 0).
Then our function would be re-written as follows:
For this case, optimization would involve altering the slope(m) of our linear graph, so that our cost function(j), can be minimized, or in order words, so that the line would perfectly fit the (x,y) data-set plotted on the graph.
Obviously this would turn out to be an iterative process, and could take a long time to complete, especially if we had a lot of data in our data-set.
Indeed one question that comes to mind when trying to alter the value of the parameter(slope) is :
How do we know what new value we should assign to the parameter?
The answer to this question was derived after it was realized that if you started with setting the values of the slope(m) say to 0, and continuously increased it, we could come up with a data-set of(m,j) that when plotted on a graph where the value of our cost function(j) is on the y axis, and m is on the x axis, would look like the figure below :
Looking at this plot, we could see that the value of j where j is at its minimum value is at point a where the curve converges.
And we could mathematically define the computation in each iterative optimization process as follows:
Then continue computing until j get to its minimum value at point(A) on the graph.
So to solve the problem of knowing what the value of ( Δ m) would be is where the almighty Gradient Descent comes to play.
Gradient Descent suggests that by looking at the graph of (m,j), and if you took the gradient(slope to a line tangent to specified point on the curve) at a point on the curve say at point d, then by setting the value of (Δ m) in the first part of Equation-7 above to the gradient(g) at that point(d), it would take j closer to its minimum.
To prove the statement above further, as we know that the gradient(g) simply means slope :
then by substituting the value of g in the first part of Equation-7:
would result in reducing the value of m, and from the diagram above, setting m to its new value would bring the new value of j closer to its minimum since the value of g would be negative at that point (d).
Furthermore, if we where to start at a point say e in the graph as shown below
the gradient(g) of the curve at that point would be positive which would also result in increasing the value of m and setting m to its new value would bring the new value of j closer to its minimum.
Knowing what values we should assign to (Δ m) is not the only problem that Gradient Descent solves, but it also helps to hasten the iterative process in an optimized and precise way.
To explain this further, Let’s say we decided to set a fixed value for (Δ m), we could either move j closer to it’s minimum at a faster rate and even move past it’s minimum if we chose a large value, or we could move j closer to its minimum at a very slow rate if we chose a smaller value, and could also move past it’s minimum value. So we might end up with an non-ending loop of trying to get j to it’s minimum for either case.
But by using Gradient Descent to compute Δ m, we notice that at further away points from the minimum say at point d we would have larger values for Δ m (because the slope at that point is very steep) which would get j closer to it’s minimum faster(but never past it), then as j gets closer to it’s minimum say at point b, Δ m would have smaller values(because the slope at that point is less steep). Once j reaches its minimum, Δ m would definitely become 0 because at point a the slope would be 0, at that point we can stop executing gradient descent. And you would certainly never have a case where we jump past the minimum value whenever a new value of Δ m is derived.
So we can see from the illustration above how Gradient Descent helps to find the best possible values for our parameters during the optimization process.
Now going back to the formula for Gradient Descent, we can recall that it was an iterative process and each time we tried to alter the slope m, we did the following:
where Δ m = g(gradient)
When we talk about getting the gradient at any point on the cost function curve, we are simply saying what is the Derivative of j at that point on the curve.
According to calculus we define the partial derivative of a function f with respect to m as simply
Since we are talking about the cost function(j), we can can say the partial derivative of the cost function j with respect to the parameter m is as follows :
Thus the first part of our optimization formula using gradient descent can be written as :
And using calculus, the partial derivative of the cost function(j) with respect to our parameter(m) in this case would be derived by applying the Power Rule and the Chain rule.
To explain further, The Chain rule suggest that if you have the following:
A is a function of B, and B is a function C, Then the derivative of A with respect to C is equal to the derivative of A with respect to B * the derivative of B with respect to C, and can be expressed as
We can apply this rule to the case where our cost function (j) is a function of the error(e) and e is a function of the slope (m).
Thus, the partial derivative of j with respect to m would be:
Now we could argue that the error (e) is a function of both m,x, b and y so why do we take the derivative of e with respect to m alone. That’s because our optimization process only requires changing m to get j to its minimum value, so all other variables such as x,y and b can be ignored and seen as constant. And this type of derivation is what we call Partial Derivation.
Considering that :
Then using the power rule to get the partial derivative of e with respect to m would be :
And to get the partial derivative of j with respect to e
Then the partial derivative of j with respect to m using the chain rule would become
If we recall correctly, the Partial derivative of j with respect to m is the same as the change in m(Δ m) in the first part of Equation-7 above. So by substituting the value of the partial derivative of j with respect to m in that equation, we would have:
Also, let’s not forget that we initially ignored the intercept(b) when considering the parameters that we could alter to minimize the cost function(j). If we consider this second parameter, then our optimization function would be as follows:
As usual, just like we derived the change in m (Δ m)to be the partial derivative of j with respect to m. we can also apply the same rule to the intercept b and say the change in b(Δ b)can be derived by computing the partial derivative of j with respect to b.
Our optimization formula would then be:
By applying chain rule to get the partial derivative of j with respect to b:
considering that :
with m, x and y being constants, the partial derivative of e with respect to b using power rule becomes :
substituting this in Equation-22 above the partial derivative of j with respect to b would be :
So, a newer version of our optimization function considering all the parameters of our linear function(m and b) would translate to:
Finally, there is one more variable i want to introduce that is usually used when implementing gradient descent. i.e the learning rate(α). This is just a number that is manually chosen to adjust the change in m or b a little bit that could aid to get our cost function j closer to it’s minimum value.
If we were to add that to our optimization function above, then we would have the following:
And from experience, it turns out that we could disregard the number 2 being used in the equation, as it can be replaced by the learning rate (α). And so finally, our equation for gradient descent now becomes :
We wrap up our discussion here on the mathematical explanation to gradient descent as applied to linear problems. We first gave a brief description of how a machine learning program works, then we discussed some basic concepts related to optimizing a machine learning function, followed by a subtle introduction to gradient descent. And finally walked through a step by step process on how to derive a mathematical formula that represents how gradient descent is applied using known rules in calculus .
If you find this post useful, don’t forget to clap back and watch out for more interesting posts from Axum Labs