The Math behind Linear Regression

Samarendra Dash
Analytics Vidhya
Published in
7 min readMar 21, 2020

--

In this article we will take a deep dive into the mathematics of Linear Regression. I will try to explain the equations behind it thoroughly without using Multivariate Calculus. And hopefully by the end we will be able to code our very own Linear Regression class in python. As a prerequisite you need to know basic differentiation and matrix multiplication and you will be fine to follow along. Let’s start then.

What is Linear Regression anyways?

In simple words if I had to put it then Linear Regression is the process to find a linear relationship between the predictor(x) and the dependent(y) variable. One thing to take note is that when I say Linear I mean a equation of 1st degree. So a predictor with p columns will have an equation of form

Finding a solution:

Let’s start with the simplest case. One data point and one predictor. Then,

We can define a predictor x₀ whose value is always one. It will help us simplifying the equation when we move forward. So now the equation will become

(Equation 1)

If we define a row matrix X

and column matrix β

then doing Xβ will give

It pretty much looks like RHS of our equation 1 right. Then what if we define Y = [y]? Then we can write

(Equation 2)

Now let’s keep this aside and focus on the Equation 1. We know this equation is not true. Unless someone give points that exactly lie on a straight line this can never happen. Then what is the next best option? We can minimize the difference/error from the original value. We will be using Squared Error as our error metric. (There are many good reasons for choosing this as an error metric. I will attach some links if you want to read about it)

So our error function will be,

Notice how the error is function of β0 and β1. That’s because the data points are already given. Only thing we can change is the coefficients. Our target is to minimize the error. If you remember from calculus we differentiate a function then set it equal to 0 to find the minimum points. As here we have two variables we will do a partial derivative w.r.t β0 and β1.

It is a fairly simple differentiation if you are familiar with calculus. I want you to notice one thing here. See how in each line x0 and x1 are placed in a row major order, but the x0 and x1 that is multiplied outside is arranged in a column major order. That rings a bell right? Yes! Transpose of a matrix.

We know the expression [y-(x0β0 + x1β1)] is same as Y-Xβ. If we define column vector de/dβ as

then we can right both the equation in the matrix form as

(Equation 3)

Okay! But what if we had two data points? Then we define the error as sum of errors at both the points. I won’t explain these much, but they are very similar to what we did before.

It looks a bit complicated. But we do have a simple equation above in form of equation 3. What if we just extend the definitions?

Y becomes a column vector

And X becomes a matrix

The transpose operation is still valid on X. If we evaluate RHS of equation 3 we have

Pretty cool right? We also could have done it simply by observing that the outer xⱼ s are arranged same way as they are arranged in Xᵀ.

So we have verified that increasing the number of data points to 2, the matrix equation still works. In reality this will be true for any number of predictors and data points given. We have the derivative as

Now we can equate the derivative to 0 to get the minima values. But what does it mean to equate a matrix to 0? Simple. If we had one entry (normal math) then we did x=0. If two points say x,y then [x y] = [0 0].

As de/dβ has p+1 entries (one coefficient for each predictor and β0), the zero / NULL vector O will be

Cool. We are getting there. Now solving for de/dβ = O we have

And … That’s it. We have our desired formula. It is actually fairly easy to implement all this in python, thanks to the numpy library. Let’s get it coded.

Code:

We can test the performance and compare it with the sci-kit learn LinearRegression library to check how it performs. We can use Mean Squared Error as our performance metric.(And in a sense we should because that is what we were trying to minimize).

Here is the code to do that

Output:

MSE of LinearRegressor 27.13405546067299
MSE of Sklearn Implementation 27.134055460672986

And Voilà!! The results are identical. Which is expected because this is exactly how Linear Regression is implemented in sci-kit learn Library. So before you go on use your Regression Class everywhere, here are some points that you should remember.

Caveats:

Although this method is very fast and accurate it is not without problems. There are two main drawbacks.

  1. Remember when I said in the simplest case, take 1 data point and one predictor? That is not valid actually. Think about it. In 2D if we have to draw a line we at least need 2 points. In 3D for a plane we need at least 3 points. So is X has p predictors then we need at least p+1 points (as p predictor axes and one y axis). The rule of thumb is n>p.
  2. Secondly if we have two columns that are correlated, say x= c*xⱼ (where c is some constant) then our method won’t work. Remember the (XᵀX)⁻¹ thing we had in the equation? That thing can’t be evaluated if we have two columns that are correlated. The explanation is a bit complex. A simple intuitive explanation will be,

“As we have x= c*xⱼ that means x can be predicted by xⱼ. So in X matrix we technically have p-1 predictors. But we have defined X as a predictor with p columns. So X matrix will act as 0 or NULL matrix (Read about singular matrix). And as we know 0⁻¹ i.e. 1/0 can’t be evaluated.”

Actually this is true even when we can evaluate a predictor column as a linear combination of multiple columns, i.e. If we have

Then too the inverse can’t be evaluated. This is not a big problem till you remove any correlated columns before running the regression.

These are the reason why sometimes we may need to use Gradient Descent. We may discuss about that in some future article. 🙂

You can find the code for this article in my github repo by following this link.

https://github.com/Samarendra109/ML-Models/blob/master/linear_model/LinearRegressor.py

Thank you for completing this article. This is my first article on Medium. I aim to write more articles on Machine Learning. If you like the article then give a clap. Do leave your thoughts in the comments. Suggestions and advice will be greatly appreciated.

--

--

Samarendra Dash
Analytics Vidhya

I am a math lover. May be I am a human but there is an equal chance that I am a simulation.