Linear Regression: Ordinary Least Squares in a nutshell

6 min readJun 28, 2022

Hi everyone!

In my last article, I explained the basics of Linear Regression using the ml_algo library. If you didn’t read it or are new to the subject, please get familiar with my post.

I mentioned that there are several ways to find coefficients of the predicting line. All of them are based on a concept called “Ordinary Least Squares”.

To understand the concept, let’s use the example from my previous article.

We had a small dataset:

Let’s write it in a linear equation system manner:

Remember, our data is biased from the origin:

It means that we have to consider the bias term by adding a new column to the feature matrix and a new value to the beginning of the coefficients matrix:

The term “1” in the first column of the feature matrix follows from the equation of the line:

where:

“b” is our bias which we denoted as “w_0_0”
“k” is our coefficient which we denoted as “w_1_0”
“y” is our outcome

We can rewrite the equation in the following way:

Please pay attention to the bias term, you can see “1” — it is a value of bias, it is always 1. That’s why we filled matrix X with ones.

Let’s rewrite the system of linear equations which we mentioned earlier considering the bias term:

Next, we should come up with the algorithm for finding the coefficients w_0_0 and w_1_0, they are the last part of the puzzle called the “predicting line equation”. How can we find them?

We can, for instance, take some best guess about the coefficients, substitute unknown terms w_0_0 and w_1_0 with it and somehow compare the actual outcome with the predicted one.

Let 10 and 20 be our best guess for the w_0_0 and w_1_0 coefficients. Let’s substitute w_0_0 and w_1_0 variables with them:

As we can see, we are far-far away from the actual outcomes.

It seems, that our best guess isn’t the best at all, it’s pity. Let’s repeat the above steps with some more random guesses for the coefficients.

After hundreds of days of resultless attempts, we realised that there should be a smarter way to find the coefficients. Let’s not just visually evaluate the difference between predicted and actual outcomes, but let’s compare them by subtracting one from another and let’s treat the result of subtracting as “how far our prediction is from the actual value”, or simply “an error”:

So we can infer a common pattern for every line in the system above:

Where

y is our actual outcome (the price)
w_0_0 is our bias coefficient
w_1_0 is our feature coefficient

To be honest, it doesn’t matter if the error is a negative or positive number, we are only interested in the absolute value of the error. Let’s consider this and rewrite the pattern:

Let’s do a substitution:

Where

“y” with a hat is our predicted outcome

Hmm, it looks like an equation for some 3d graph. Let’s draw it:

It reminds me of a sheet of paper folded in half.

Apparently, we should find such w_0_0 and w_1_0 that make our “z” coordinate (which is our error) equal to 0 (this is a “folding” line). Quite logical, isn’t it? We need to minimize the error meaning minimize the “z” coordinate.

How can we do the minimization effectively? From a math perspective, we can find the extremum (which is minimum in our case) point where the tangent line (or the tangent plane in our case) touches the curve (or the surface in our case). To find such a line it’s simply needed to get a derivative of the global minimum point of the function. At such a point our derivative is always equal to zero.

Why minimum point? Because we are concerned only with finding the lowest possible error.

Let’s take the 2d projection of our 3d surface and try to draw the tangent line:

The pink line is our tangent line. Wait a minute, is it the only possible tangent line? What if we draw the following:

It seems that we can draw an infinite number of tangent lines at the minimum point of this function which means that the derivative isn’t defined for our case. It’s better to choose another function.

First, let’s ask ourselves: does it really matter how we eliminate negative values, does it change something? Apparently, no. We can simply use squared error instead of absolute one:

The effect is the same, but the function now has the derivative. Let’s look at the plot:

From the geometrical perspective, we can draw only one tangent line at the global minimum point. Great, squared error is our best choice.

It’s much easier now to find the proper w_0_0 and w_1_0, we have a direction of moving since we have an error function to minimize.

Congratulations, we’ve just inferred the target function for Ordinary Least Squares! Let’s look at the function again and memorize it:

This is one of the sacral things of Statistics and Machine Learning. The main goal of the Ordinary Least Squares is to minimize the squared difference between “y” and “y with a hat” terms.

You can ask me, how to apply this function to the actual dataset? To use this function in the context of multiple data points, we should find the squared difference for every point of the dataset and sum all the results up. Let’s use 10 and 20 as values for w_0_0 and w_1_0:

Once we find the minimal possible squared error, it means that we achieved our target, we can use w_0_0 and w_1_0 coefficients for the prediction.

So far the error is 368123020300, that’s too much. We should definitely improve the values of our w_0_0 and w_1_0 coefficients. There are several ways to find the best values, one of them is the Closed-Form solution for our squared error.

That’s pretty much it,

Thank you for reading!

Linear Regression: Ordinary Least Squares in a nutshell

Written by Ilia Gyrdymov