Multiple Linear Regression from Scratch using Python

Jair Neto

Follow

Published in

Analytics Vidhya

5 min readAug 26, 2021

--

In the previous post, you learned how to implement a simple linear regression from scratch using only NumPy. In today’s post, I will show how to implement a multiple linear regression from scratch also using only NumPy.

Multiple Linear Regression

In the simple linear regression, we want to predict the dependent variable ‘y’ using only one explanatory variable ‘x’ like the equation below.

y= ax + b

‘y’ is the dependent variable, ‘x’ is the explanatory variable, ‘a’ is the slope of the line, and ‘b’ is the intercept variable in other words the value of ‘y’ when ‘x’ is zero. On a linear regression, we want to find the values of ‘a’ and ‘b’ that minimizes the prediction errors.

Multiple Linear Regression is an extension of linear regression used when you have more than one explanatory variable to predict the dependent variable.

Fig1. Multiple linear regression formula

Where for i=n observations:

Y = Is the dependent Variable.
Xs = Are the explanatory variables.
ß0 = Is the y-intercept (constant term).
The others ß are the slope coefficients for each explanatory variable.
The ß are also called weights.

If you pay attention, the linear regression is a simple version of the multiple regression where all the ß terms from ß2 to ßp are zero.

As an example, let’s assume that you want to sell your car and want to estimate how much your car is worth it. You know that factors such as model year, horsepower, and mileage influence the car price. In that case, you could create a multiple linear regression like the one below.

Fig2. A Multiple Linear Regression example.

But how can I know what are the best values of the Betas?

This part it’s similar to the simple linear regression. We want to minimize the cost function using the Gradient Descent Technique. If you don’t know what those terms are you can learn them in my medium post.

From my previous post, you know that the cost function is the function below.

Fig3. The simple linear regression error function. Where n is the number of observations in the data, ȳ are the predicted values and y are the actual values

And our goal was to find the values of ‘a’ and ‘b’ that minimizes the value of the cost function. The derivatives from the simple linear regression where:

Fig4. Partial derivatives of a simple linear regression

For the multiple linear regression, the process is the same, but now we add an X0 = 1 to the equation so we could generalize the derivate of the cost function. So the multiple linear regression formula became:

Fig5. Complete multiple linear regression formula

The derivative of this function is

Fig6. The partial derivative of linear multiple regression.

To update the weights, we just need to multiply the derivative by a learning rate and subtract from the previous weights.

Fig7. Formula to update the *ßs, where* α is the learning rate.

It’s important that we simultaneously update all ß.

This is an iterative process, could we make it more efficient by using matrices?

Vectorized Multiple Linear Regression

In Python, we can use vectorization to implement the multiple linear regression and the gradient descent. We can transform the ys, ßs, and Xs into matrices like the image below.

With that image we can have the predicted y using the formula below:

Fig9. Vectorized formula to get the predicted values

The derivative with the formula below:

Finally, to get the updated weights we have the equation below:

Fig11. Vectorized formula to update the weights

Let’s write the code

Fig12. Multiple linear regression function declaration

The first piece of advice for the people that are learning Data Science but do not have a software engineering background. Always document your code.

So we start with a function called fit_linear_regression that will receive the Xs, Ys, learning rate and, epsilon. Epsilon works as a threshold, we will stop when the error is less than epsilon.

Fig13. Multiple Linear Regression in Python

In Step 1 we insert a column containing 1 to be the y-intercept into the x NumPy array.
In Step 2 we initialize the ßs, here I am calling weights. The weights will be a NumPy array containing the number of variables in X.
In Step 3 we will update the weights until the norma of the partial derivative is less than the epsilon.
In Step 3.1 we get the predicted values like in figure 9 and the partial derivative like in figure 10.
In Step 3.2 we get the norma.
In step 3.3 we update the weights like in figure 11.
The if in lines 41 and 42 is to warn us when we put a high learning rate and the functions diverged.
The return of the function is the adjusted weights.

Now that we have the correct weights, how do we predict values?

Making predictions

To make predictions we just need to take the dot product between the weights array excluding the last value that is the y-intercept and the transposed Xs values after that get this result and sum it with the y-intercept.

Now we implemented our multiple linear regression from scratch, but how its compare with the sklearn?

Comparing with sklearn

Fig 15. MSE from sklearn function and from our function

The first is the Mean Squared Error from the sklearn model and the second is the MSE from our function.

As we could see, they are similar. Our function MSE is just 0.004 greater than the sklearn.

Conclusions

In this post, you have learned

What is multiple linear regression.
How we can fit a multiple linear regression model.
A vectorized multiple linear regression formula.
How to implement your own multiple linear regression using only Python and NumPy.

You can see the code used to write this post in this Colab notebook.

If you like what you read be sure to 👏 it below, share it with your friends and follow me to not miss this series of posts.