Mathematics Behind Gradient Descent

Published in

Geek Culture

9 min readNov 4, 2021

Understand how gradient descent algorithm works in machine learning.

If you are a machine learning enthusiast you have probably come across the term gradient descent algorithm or even have used it in some of your models. You would know that it is an optimization algorithm and would have applied it using the sklearn library. But do you really know what is happening under the hood?

This article will discuss the mathematical intuition behind the gradient descent algorithm, its various types and how it is used in machine learning.

Introduction:

Gradient Descent is an optimization algorithm that is used to find the values of the parameters of a function (linear regression, logistic regression etc.) that is used to reduce a cost function. In very simple terms, it helps us to find the best fit line. How does it do this? Before going into this, we have to know what some terms mean.

Derivatives

To understand what is derivatives, we must know what a slope is. A slope is defined as the ratio between the vertical change and the horizontal change between any two points in a line (Slope = Δy/Δx). It is used to describe both the direction and how steep the line is.

Now we know how to find the slope of a line given two points in the line. But how to find the slope of a given point in a line? There is nothing to measure here. This is where the derivative comes into place. With derivatives, we use a small difference and then have it shrink to zero!

Let the function be f(x) and to find the derivative of this function we use the slope formula. x changes from x to x+Δx and y changes from f(x) to f(x+Δx). Then we make Δx shrink towards zero after simplifying the equation. And this way we get the derivative of a point in a line.

For example, let the function f(x) = x²

Then, to find the derivative or slope, we follow the below steps:

f(x) = x² and f(x + Δx) = (x + Δx)² = x² + 2x Δx + (Δx)²
Slope formula is (f(x + Δx) − f(x))/Δx
Substituting the equations in point 1 in the slope equation, the formula of the slope changes to (x² + 2x Δx + (Δx)² − x²)/Δx
After simplifying the equation in point 3, we get 2x + Δx
Then, as Δx shrinks towards 0, slope = 2x

This way, we can find the derivative or slope of any point in a curve. There are also certain rules that we follow while finding derivatives. You can read about it at the link here.

Partial Derivatives:

In mathematics, partial derivatives are defined as “finding the derivative of a function with several variables with respect to one single variable by keeping the other variables constant”. In simple words, if a function has two or more variables, we find the derivative by keeping some variables constant.

For example, if the function f(x) = x² + y², then the partial derivative of f(x) with respect to x will be 2x + 0 (since y is constant). And the partial derivative of f(x) with respect to y will be 0 + 2y(since y is constant). That is,

If f(x) = x² + y², then, f’ₓ = 2x and f’ᵧ = 2y

The partial derivative is sometimes also represented as ∂.

Cost function

Let me explain this with the help of linear regression. For simplicity, let us take the problem as follows: we have to find the mark of a student given the number of hours they studied. When we plot this for a fictional dataset, we would get a scatter plot that looks something like this:

We can have so many lines that connect these data points, but we need a single line that is the best fit. And to find this best fit, we introduce a term called cost function, which is nothing but the error function. We find the error by subtracting the actual value and the predicted value, for each data point and this difference is then squared (so that the result is not skewed since the difference can be a negative value also). These results are then added together and we get the most commonly used cost function, mean squared error (MSE). The formula for finding it is given below:

Here the first y term represents the actual y value and the second term represents the predicted y value. The predicted y is calculated by the formula y = mx + b where m is the slope of that line and b is the intercept. So the mean square error now becomes:

Once we find the line that gives the least cost function, that will be the best fit line for this problem.

How does Gradient Descent work?

Since now we know what are derivatives, partial derivatives and cost functions, we can now learn about the concept of gradient descent.

Gradient descent is a first-order, iterative optimization algorithm to find a local/global minima of a differentiable function. We do know that we have to find the line that gives us the least cost and this is done using gradient descent. But how does this happen?

For this, we will first plot a 3d plot between the coefficients, slope(m) and intercept(b), and MSE(cost) as seen above. Since the gradient descent algorithm is an iterative approach, we first randomly take the values of m and b and then change it such that the cost function becomes less and less until we reach local/global minima.

First, let m = 0 and b = 0 and for these, we get a higher value of MSE(according to the plot above, the value will be approximately 1000).
Then we take a step (step here refers to the change in the coefficients m and b) that reduces the value of m and b so that the MSE reduces (approximately 900). This step is done iteratively until we reach the local/global minima.
Once we reach the local/global minima, we can use this value of m and b in our prediction function (y = mx + c).

Now we will see how the value of m and b is changed such that we get less MSE for each step. For this, let us plot the graph of MSE and the intercept(b).

In the above graph that I have drawn, the blue star is the starting point (value) of the MSE and the red dot is the local/global minima of the function. We have to reach from the blue star to the red dot and for this, we first try to take fixed steps. But as we can see above, this approach does not always lead to reaching the minima and thus is not effective since the gradient descent may never converge. So we can take another approach where we reduce the size of the step we take each time and this is shown in the below graph.

We see from the above graph that, the step size keeps reducing each time we take a step and thus finally the gradient descent converges to the local/global minima. But now the question is how do we take steps with reducing sizes?

For this, we have to find the slope (or the tangent) at each point so that we will know which direction to go in. This slope is nothing but the derivative of that particular point. So since there are two values, slope (m) and intercept (b), we have to find the partial derivative of the MSE (cost) with respect to both m and b. We have seen how to find the partial derivatives in the above sections, so the end result will be as below:

Along with this, we have another parameter called the learning rate, that decides the step size that we take.

Learning rate (α):

The learning parameter (α) is a hyperparameter that is responsible for determining how long the steps will be, that is how much the coefficients can change. Typically, the user will set the learning rate for the gradient descent algorithm and it will be constant for the entire algorithm. The optimal value of the learning rate will lead to faster convergence within fewer steps. If this rate is too high than the optimal value, then the algorithm will find it difficult to converge to the local/global minima and the cost tends to increase over time. If the value is too low, even though it converges to the minima, it takes a lot of time. So it is necessary to find the optimal value of the learning rate. The effect of learning rates on convergence is visually represented below.

Source: Effect of learning rates on convergence

Once we calculate our derivative and decide on the learning rate, our next step is to use these two values to change the coefficients m and b. This is done by the following formulae:

m = m — learning rate * ∂/∂m = m — α * ∂/∂m
b = b— learning rate * ∂/∂b = b — α * ∂/∂b

Overview

We have covered all the necessary concepts related to how gradient descent works. So let us see an overview of the steps involved in the gradient descent algorithm.

STEP 1: Take some random values for the coefficients m and b and calculate the MSE (cost function).

STEP 2: Calculate the partial derivatives of MSE with respect to m and b.

STEP 3: Set a value for the learning rate. And calculate the change in m and b using the following formula.

m = m — learning rate * ∂/∂m = m — α * ∂/∂m
b = b — learning rate * ∂/∂b = b — α * ∂/∂b

STEP 4: Use these values of m and b to calculate the new MSE.

STEP 5: Repeat steps 2, 3 and 4 until the changes in m and b do not significantly reduce the MSE (cost).

Types of Gradient Descent

There are three variants of the gradient descent algorithms discussed above. Let me tell you the major differences between these.

Batch Gradient Descent:

Let there be ‘n’ observations in a dataset. Using all these ‘n’ observations to update the coefficient values m and b is called batch gradient descent. It requires the entire dataset to be available in memory to the algorithm.

Stochastic Gradient Descent (SGD):

SGD, in contrast, updates the values of m and b for each observation in the dataset. These frequent updates of the coefficient provide a good rate of improvement. However, they are more computationally expensive than batch gradient descent.

Mini-Batch Gradient Descent:

Mini-batch gradient descent is a combination of the SGD and batch gradient descent. It splits the dataset into batches and the coefficients are updated at the end of each of these batches.

Conclusion

At the end of this article, you would have learnt about the mathematical intuition behind the gradient descent algorithm and how it works in machine learning. You would have also learnt about the three variants of the algorithm.

I hope it has increased your understanding of gradient descent and given you a clear picture of what happens behind the hood.

Thank you for reading and if you found the article useful, please do share it!