Linear Regression : OLS vs Gradient Descent

6 min readMay 1, 2022

From the time I first learned Linear Regression, I have always wondered about the various methods to arrive at the best fit line. I have come across many articles which explains each of these methods in details, but not the intuition behind them; may be I’m googling wrong.

So, after many attempts to untangle this mystery, I have finally come to understand the intuition behind these methods. I have tried to keep it as simple as possible, without involving much of the math. Let’s get started!!!

Objective of Linear Regression

Let us consider a data set having 5 values [(1,2), (2,3), (4,7) (5,5) (7,11)]. These values when plotted on a line graph would look something like this (without the line of course).

Now, our objective is to find out a line y = mx +b, (read b=c in Fig. 2) such that it describes the linear relationship between X and Y up to a certain accuracy. However, we have 2 unknowns here, ‘m’ and ‘b’.

Note that in algebra we represent the equation of a line as y = mx + b but in statistics this is represented as h(x) = theta0 + theta1*x, where h(x) is the predicted value which is a function of x.

Fig 2 : **Plot showing error in prediction**

Taking a look at the line above, we can see that such a line cannot exactly predict the values of Y, albeit there will be a certain error (marked in yellow). This is called the ‘Residual error’ or simply the ‘Residual’. So, if we input the value of x = 4 in the equation of the line, we might get y = 5.8. However, from our data set, we know the correct y-value is 7.

Ordinary Least Squares Method (OLS)

Concluding from above, we can say, if we can find a line y = mx + b, such that the sum of residual errors for all points in the dataset tends to zero, we can safely say the line will best represent the linear relationship between the variables. However, points can lie on either side of the line, thus rendering the residual positive for some values and negative for others. To eliminate this possibility, we square the difference between observed and predicted values.

Fig 3 : Quadratic Polynomial showing the ‘Residual Sum of Squares’

This is how we arrive at the term ‘Sum of Squares’, and since minimizing the residual will give us the ‘best fit’ line, we call this method as the ‘Least Square Method’.

There are 2 ways to calculate the least square method.

The first way is to minimize the partial derivatives of the quadratic polynomial given in Fig.3 in ‘m’ and ‘b’, first w.r.t ‘m’ and then w.r.t ‘b’, using the chain rule, and setting them equal to 0.
The second approach is to directly apply the formula of least squares given below.

Fig 4 : **Direct approach to Least Square**

The link below has a very detailed explanation of the above 2 approaches with the same example we are discussing.

Linear Regression | Brilliant Math & Science Wiki

Linear regression is a technique used to model the relationships between observed variables. The idea behind simple…

brilliant.org

Why Gradient Descent?

For those who know a little bit about the Gradient Descent approach(to be referred as GD going forward) , might wonder, if we are able to get the slope and intercept terms so easily from our provided data set, why even bother to go through GD. You might as well be correct, because from the examples we saw, we can see that the line y = 1.39x + 0.33, almost perfectly predicts each point in the data set. Well, there are a couple of reasons behind it.

Overfitting : Overfitting is when the model tried to predict the values of ‘y’ too accurately and hence results in a huge error. While the method of least squares often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit a model. This can be overcome using LASSO or Ridge regression.
Outliers : This is again related to overfitting. An outlier point will also be accommodated in the OLS equation, which will drag the line towards that outlier, thus rendering the contribution of many valid points useless.
Normality assumption : One of the assumptions of Linear Regression is that the residuals are normally distributed. With the OLS method, the test statistics can give unreliable results if the data is not normally distributed.
OLS is usually computationally heavy with an order of complexity of O(n³) for an n*n matrix, where ’n’ is the number of features.

Thus Gradient Descent

To overcome all the above anomalies in a dataset, predicting the ‘best fit’ line might become unreliable using the OLS method. The Gradient Descent approach minimizes most of these shortcomings.

In GD approach, we take the same partial derivatives of ‘m’ and ‘b’, but instead of equating them to zero, we use a predictor method based on learning rate to come to the best possible value of ‘m’ and ‘b’.

Fig 5 : **Cost function of Gradient Descent**

Here, we defined the sum of square of residuals as the ‘Cost Function’, so our objective is to minimize the cost function.

First, we do the partial derivatives on ‘m’ and then on ‘b’.

Fig 6 : **Partial Derivatives w.r.t slope and intercept**

Now let’s calculate the gradient of Error w.r.t to both ‘m’ and ‘b’ :

Fig 7 : **Error terms of the partial derivatives**

Learning Rate (α)

Now, we introduce a term called ‘Learning Rate’ into our partial derivatives equation (Fig 6) with the error terms we derived in Fig 7.

Fig 8 **: ∂m = Change of ‘m’ w.r.t error**

As mentioned earlier, the objective of GD in Linear Regression is to arrive at an optimum ‘m’ and ‘b’ iteratively using the ‘∂m’ given in the figure above (Fig 8). This can also be represented as below.

Fig 9 : **Gradient Descent equation for convergence on ‘m’ and ‘b’**

Now, we start with an initial value of ‘m’ and use the ‘∂m’ to arrive at the optimum ‘m’. This is repeated for the intercept ‘b’ as well. The learning rate is a configurable hyper-parameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0. A learning rate that is too large can cause the model to converge too quickly to a sub-optimal solution, whereas a learning rate that is too small can cause the process to get stuck.

Fig 10: **Effect of different learning rates on Convergence**

Conclusion

The point of this article was not to elaborate on gradient descent or OLS method. I have attempted to simply demystify the difference between the 2 principle methods of arriving at a linear relationship curve between the dependent and independent variables. There are many articles available online which delves into details of each of these methods, however, through this article I hope, I have answered the question ‘when to use each approach’.

Bibliography

https://www.mathsisfun.com/data/least-squares-regression.html

https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

https://towardsdatascience.com/https-medium-com-dashingaditya-rakhecha-understanding-learning-rate-dd5da26bb6de