Linear Regression on Boston Housing Prices
What is Linear Regression?
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression.
For more than one explanatory variable, the process is called multiple linear regression.
Source : Wikipedia
Basically forming a line. That line is called as Ypred.
Y pred = b1x + b0
Boston Dataset :
This dataset was taken from the StatLib library and is maintained by Carnegie Mellon University. This dataset concerns the housing prices in housing city of Boston. The dataset provided has 506 instances with 13 features.
Without much ado, let’s jump straight into this :
I’m loading the dataset from sklearn datasets by using the following code :
Taking a look at how the data looks like by using the df.head() function :
Now, we gotta have the target column which is arrived by using the .target function
After this, we need to find what columns can be categorical and what columns can be numerical before jumping into df.describe(). For this we use the df.info and take a look at how the columns originally looks like.
Now, we gotta find which can be changed into categorical (object) by taking a look at the dataframe and gotta change it. In my view, CHAS and RAD can be categorical. That is why I’m converting both into objects.
After doing this, we can now jump into finding the five point summary of the dataframe.
It is time for us to jump into EDA part. As I have mentioned earlier, I’m attaching my article on EDA on how to do EDA.
We start our EDA by checking the null values and it is found that there is no null values in the dataset. The null values are checked by using isnull() function
Since we don’t have any null values we can go forward with outlier treatment. To check for outliers we use the boxplots and find how the outliers are in this dataset.
Outlier treatment : I have replaced the outliers with iqr treatment only difference is that I have replaced outliers with 0.99 percentile and 0.01 percentiles.
Now checking the graph again :
Now, we gotta check the correlation and it’s pair plot and heatmap time. !!!
So sns.pairplot gives the following :
We are also gonna take the heatmap for the correlation :
We can see that there is high positive correlation between the target and RM and less between LSTAT. Now we gotta jump into building our base model first.
Train test split : We have to split the data into training data and test data to ensure that our data is trained and tested. We have to give the input and output variables to ensure that the split occurs properly.
Adding constant to the xtrain. Now we gotta add a constant because the model always asks for a constant.
Now, building our OLS Model.
After this a summary column is displayed.
Now we get that rsquare is 71% and we need to ensure that we are improving this model. We are also gonna look at the linear regression way of doing it.
This is how we do it without using the OLS. This is linear regression way of doing it
Improving the model : Now, this hit me. I really wanted to improve by using standard scaler then get dummies and then using RFECV for feature engineering to ensure that my model quality improves greatly.
Use of RFECV :
This gives the rank of the columns that are useful for doing the model building. The 1 in the array tells that we need them. So we need to see what those columns are so we are gonna make it into a dataframe and then work on it.
So we have checked for 1 and now we are gonna mask it and then make it into a seperate dataframe.
Using get dummies on CHAS : We are gonna take chas and then get dummies for the column and then we are gonna concat c1 and c2 so that they make a perfect dataframe.
Then later we are gonna split this into train test split and then we do the model building.
But once again, the model didn’t improve much and it gave the same 71 % it was there. So I wanted to use the standard scaler to scale the data to get better performance results.
Scaling the data : By using the standard scaling method, I’m doing the scaling and then we are gonna build a model based on the scaled data.
We can see the scaled data and now we are gonna use the boxcox from scipy.stats to ensure that we are normalizing the data. Then later on I’m building the model again with the scaled data.
The model performance improved and it became 75% following the scaling of the data :
This data will be further improved by using feature improvement techniques and then I’ll write a separate article regarding this.
The github link for the whole stuff is posted in https://github.com/navinniish/Linear-Regression
Thanks for reading.