Linear Regression on Boston Housing Prices

Navin Nishanth K S
Analytics Vidhya
Published in
6 min readOct 12, 2020

What is Linear Regression?

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression.

For more than one explanatory variable, the process is called multiple linear regression.

Source : Wikipedia

Basically forming a line. That line is called as Ypred.

Y pred = b1x + b0

Boston Dataset :

This dataset was taken from the StatLib library and is maintained by Carnegie Mellon University. This dataset concerns the housing prices in housing city of Boston. The dataset provided has 506 instances with 13 features.

Without much ado, let’s jump straight into this :

I’m loading the dataset from sklearn datasets by using the following code :

Loaded the libraries and also imported the dataset

Taking a look at how the data looks like by using the df.head() function :

Features present in the Boston Dataset

Now, we gotta have the target column which is arrived by using the .target function

After adding the target column

After this, we need to find what columns can be categorical and what columns can be numerical before jumping into df.describe(). For this we use the df.info and take a look at how the columns originally looks like.

The columns are in float.

Now, we gotta find which can be changed into categorical (object) by taking a look at the dataframe and gotta change it. In my view, CHAS and RAD can be categorical. That is why I’m converting both into objects.

used astype to change the datatype

After doing this, we can now jump into finding the five point summary of the dataframe.

I have got the describe for the 50th percentile

It is time for us to jump into EDA part. As I have mentioned earlier, I’m attaching my article on EDA on how to do EDA.

We start our EDA by checking the null values and it is found that there is no null values in the dataset. The null values are checked by using isnull() function

I used .sum again to get the total

Since we don’t have any null values we can go forward with outlier treatment. To check for outliers we use the boxplots and find how the outliers are in this dataset.

There are outliers and I’m attaching only one column to ensure that this article doesn’t get big

Outlier treatment : I have replaced the outliers with iqr treatment only difference is that I have replaced outliers with 0.99 percentile and 0.01 percentiles.

Replaced the value using lambda functions

Now checking the graph again :

We have reduced the outliers to a greater extent.

Now, we gotta check the correlation and it’s pair plot and heatmap time. !!!

So sns.pairplot gives the following :

We are also gonna take the heatmap for the correlation :

We can see that there is high positive correlation between the target and RM and less between LSTAT. Now we gotta jump into building our base model first.

Train test split : We have to split the data into training data and test data to ensure that our data is trained and tested. We have to give the input and output variables to ensure that the split occurs properly.

I have taken the output as target and input have dropped 3 things because they are categorical. I’m just going forward with numerical stuff

Adding constant to the xtrain. Now we gotta add a constant because the model always asks for a constant.

The xtrain shape actually is 11 columns with constant column added it becomes 12

Now, building our OLS Model.

Here the mistake we make is , leaving out the order and also leaving out fit()

After this a summary column is displayed.

Notice the Rsquare value

Now we get that rsquare is 71% and we need to ensure that we are improving this model. We are also gonna look at the linear regression way of doing it.

This is how we do it without using the OLS. This is linear regression way of doing it

Improving the model : Now, this hit me. I really wanted to improve by using standard scaler then get dummies and then using RFECV for feature engineering to ensure that my model quality improves greatly.

Use of RFECV :

This gives the rank of the columns that are useful for doing the model building. The 1 in the array tells that we need them. So we need to see what those columns are so we are gonna make it into a dataframe and then work on it.

So we have checked for 1 and now we are gonna mask it and then make it into a seperate dataframe.

I have made use of the columns that are useful and now gonna use the categorical to get dummies

Using get dummies on CHAS : We are gonna take chas and then get dummies for the column and then we are gonna concat c1 and c2 so that they make a perfect dataframe.

Using the Concat we are concating the 2 columns

Then later we are gonna split this into train test split and then we do the model building.

But once again, the model didn’t improve much and it gave the same 71 % it was there. So I wanted to use the standard scaler to scale the data to get better performance results.

Scaling the data : By using the standard scaling method, I’m doing the scaling and then we are gonna build a model based on the scaled data.

We can see the scaled data and now we are gonna use the boxcox from scipy.stats to ensure that we are normalizing the data. Then later on I’m building the model again with the scaled data.

The model performance improved and it became 75% following the scaling of the data :

This data will be further improved by using feature improvement techniques and then I’ll write a separate article regarding this.

The github link for the whole stuff is posted in https://github.com/navinniish/Linear-Regression

Thanks for reading.

--

--

Navin Nishanth K S
Analytics Vidhya

Data Analyst at Cloud Mentor, ML enthusiast and a student who has passion for learning loads of stuffs in DS. || https://www.quora.com//Navin-Niish