Mathematics behind Gradient Descent..Simply Explained
So far we have discussed linear regression and gradient descent in previous articles. We got a simple overview of the concepts and a practical tutorial to understand how they work. In this article, we will see the mathematics behind gradient descent and how can an “optimizer” get the global minima point. If the term “optimizer” is new for you, it is simply the function that works to determine the global minima point which refers to the coefficients of best-fit line in linear regression algorithm. By the way, similar concepts are used in deep learning algorithms. Let’s take a look how those things are going.
When using “mean square error” to determine the coefficients of the best-fit line, our main task to find the point where MSE is minimum. In other words, the global minima point is the point where the slope of the curve is equal to zero. If we can get the derivative of the MSE (cost function) versus the best-fit line coefficients (slope and intercept), we nailed it.
The equation of the fit-line is y=mx+b (where m is the slope of the line and b is the intercept to the y-axis. Back to the derivative rules, we can apply them to our MSE equation. First, let’s express the term (y^-y) in cost function J(m) as Error to simplify the equation.
Using the chain rule, we can find the derivative of J(m) as follows.
The two components of the equation can be calculated by the following equations.
Back to the derivative of the cost function to the slope (m), we will find that it’s equal to some constants multiplied by error(the difference between predicted y values and actual data points).
This equation can be explained as how much change of cost function versus the slope so, if we change the slope gradually until we get the minimum error, we reach the global minima point.
m = m + ∆m
b = b + ∆b
Similarly, the value of the intercept can be found by calculating the derivative of the cost function to intercept (b).
The value that the variables (m and b) move by is called as “learning rate” which is defined by a small value that can be determined during the fitting process in a way that it’s not too small to make the convergence time towards the global minima fast and not too large which may cause that the global minima point never be reached.
The final equations that determine the best-fit line coefficients as follows.
Now we can define a learning rate (λ) and move slowly towards the minimum error point and get our best-fit line.
A final note that must be mentioned here, the value of the slope can be determined theoretically by the following equation.
and hence, the intercept can be found as follows.
But in the code that build linear regression algorithms or any optimizer in DL is using a learning rate and moves until it reaches the global minima point.
References:
3.5: Mathematics of Gradient Descent — Intelligence and Learning
StatQuest: Fitting a line to data, aka least squares, aka linear regression

