Multiple Linear Regression from scratch using only numpy

Debidutta Dash
Analytics Vidhya
Published in
5 min readFeb 9, 2021
Image source: https://morioh.com/p/0d9b2bedf683

Linear regression is the starter algorithm when it comes to machine learning. With the help of libraries like scikit learn, implementing multiple linear regression is hardly two or three lines of code. The dataset can also be handled easily with the help of pandas but I have tried to avoid that approach. In this article I have explained how to implement the algorithm using the classical approach of matrices using python. This approach focusses on implementing the algorithm straight from the pages of the book to code! The only package I have used is numpy, because our dataset deals with matrices, and numpy has many functions to efficiently handle them, without which we have to write a large number of loops, and that might still not be as effective as it would be with numpy. So let’s get started right away!

Linear regression with more than one input is called multiple linear regression or multivariate regression. In this implementation I have used the Real estate dataset which has several features and the algorithm tries to predict the price, which is the predictor.

We have to make a few imports at first, which will be subsequently used in our code.

As promised I won’t be using pandas. So we have to take the pain of writing a function for loading the dataset.

HEADERS
['No', 'X1 transaction date', 'X2 house age', 'X3 distance to the nearest MRT station', 'X4 number of convenience stores', 'X5 latitude', 'X6 longitude', 'Y house price of unit area']
Dataset Size
414 X 8

From this dataset, let us select our features and target variables.

Size of X
(414, 5)
Size of Y
(414,)

Having set up our data, let us just have an overview of how the algorithm works and what exactly do we have to code, without diving into the details of the proofs behind all the math. I will try my best to relate the algorithm with our real estate dataset that we are using, for easier understanding.

Getting started

At this stage, we have N number of data samples and we have divided it into our feature matrix and the target matrix. We have a total of six features in our dataset ( ‘X1 transaction date’, ‘X2 house age’, ‘X3 distance to the nearest MRT station’, ‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’) excluding the first column(‘No’) and the target column (‘Y house price of unit area’). We have to find a relation to generate the target Y and formulate it into an equation which is a function of the different features. For convenience, I have also excluded the first feature ( ‘X1 transaction date’), because the data is not available in a proper format in the dataset. Here we have N = 414 samples which will also be reduced when we split the data into training and testing sets. So at this point N = 414 and number of features(let’s say p)= 5

For a single feature we have the linear equation of the form

y = 𝛽0+𝛽x,

which on extending to multiple features for a single sample takes the form

𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+…𝛽𝑝𝑥𝑝

For N samples we can generalize it as

𝑦1=𝛽0+𝛽1𝑥11+𝛽2𝑥12+…𝛽𝑝𝑥1𝑝

𝑦2=𝛽0+𝛽1𝑥21+𝛽2𝑥22+…𝛽𝑝𝑥2𝑝

𝑦𝑖=𝛽0+𝛽1𝑥𝑖1+𝛽2𝑥𝑖2+…𝛽𝑝𝑥𝑖𝑝

𝑦𝑁=𝛽0+𝛽1𝑥𝑁1+𝛽2𝑥𝑁2+…𝛽𝑝𝑥𝑁𝑝

If we represent the above equations in form of a matrices we have X , 𝛽 and Y matrices of the order (N X p), (1 X (p+1)) and (N X 1) respectively.

Representing this system of equations in matrix form, we have

Y = X * transpose(𝛽)

But to perform this matrix multiplication, we have to make X as (N X (p+1)). We observe from the above equations that the x0 term is 1 in every equation. The dataset which we are working on now has so far N = 414 and p = 5, which makes our X as the order of (414 X 5). But for the matrix multiplication we have to make it of the order (414 X 6) by adding a column of ones first(the bias( x0) term). Also, our Y should be a column vector (414 X 1)which is now in the form of a row vector.

Size of X
(414, 6)
Size of Y
(414, 1)

Let’s also have a look at our X and Y matrices to check if everything is fine.

X headarray([[  1. , 32. , 84.87882, 10. ,  24.98298, 121.54024],
[ 1. , 19.5, 306.5947, 9. , 24.98034, 121.53951],
[ 1. , 13.3, 561.9845, 5. , 24.98746, 121.54391],
[ 1. , 13.3, 561.9845, 5. , 24.98746, 121.54391],
[ 1. , 5. , 390.5684, 5. , 24.97937, 121.54245]])
Y headarray([[37.9],
[42.2],
[47.3],
[54.8],
[43.1]])

So far so good!

Splitting the dataset randomly

Now we follow the conventional approach of splitting the dataset into train and test sets with training set as 70% of the dataset.

Let’s now have a look at the shape of our training and testing sets.

TRAINING SET
X_train.shape: (290, 6)
Y_train.shape: (290, 1)

TESTING SET
X_test.shape: (124, 6)
Y_test.shape: (124, 1)

We are going to apply the algorithm to our training set only and check the performance of the algorithm using the testing set. So, our N reduces from 414 to 290 now (70%).

From our matrix equation we already have the X matrix and Y matrix ready, and our goal is to find the 𝛽 matrix (or more precisely the coefficient of features, but from now on let us call the 𝛽 transpose matrix as 𝛽) such that the Y obtained from the matrix multiplication (Y = X𝛽) is closest to our actual Y matrix. In other words we have to find beta (𝛽) matrix so that error is minimum.

To do this we have two approaches:

  1. Gradient Descent method
  2. Normal Equation method

In this article we are going to follow the Normal Equation method. The Normal Equation is a method of finding the optimum beta(𝛽) without iteration.

This comparison has been taken from Andrew Ng’s Machine Learning course on Coursera

In the Normal equation method, the optimum beta is given as:

Mathematical proof of the Normal equation requires knowledge of linear algebra but for those interested can move here.

Beta value and prediction

Next we implement a function to find the value of 𝛽 and start predicting with our calculated 𝛽.

(124, 1)

We can see that our code has so far predicted the value of Y for all the test samples. It’s time for us to check the accuracy of our model.

Checking the accuracy

Mean Absolute Error:  5.484897442452742
Root Mean Square Error: 7.038888303432659
R square: 0.6927055239131656

So, the model looks kind of okay, but there is still scope for improvements. As for me, I leave it to you! :-)

References and Credits:

  1. Dataset : https://www.kaggle.com/quantbruce/real-estate-price-prediction
  2. Deep dive to math for normal equation proof : https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression
  3. Image used from : https://morioh.com/p/0d9b2bedf683
  4. Thanks to my friend Ashish sureka (LinkedIn: https://www.linkedin.com/in/ashish-sureka/) who has helped me with concepts and code.

--

--