Predicting House Prices Using Linear Regression

Published in

Africa Creates

6 min readOct 11, 2017

I set out to use linear regression to predict housing prices in Iowa.I will be highlighting how I went about it, what worked for me, what didn’t and what I learnt in that process.

First what is the problem?

The problem is to build a model that will predict house prices with a high degree of predictive accuracy given the available data. More about it here. “With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.”

Where I got the data

The dataset is the prices and features of residential houses sold from 2006 to 2010 in Ames, Iowa. Obtained from here.

Environment and tools.

jupyter notebook, numpy, pandas, seaborn, matplotlib, scipy and scikit-learn.

First things first, Import the python libraries and dataset

pearsonr (Pearson correlation coefficient) is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

Load the train and test file

Lets drop the ‘Id’ column in the data as it is not necessary for prediction

Let’s see what the dataset looks like

The training data has 1460 observations and 80 explanatory variables and the test file has 1459 observations and 79 explanatory variables. The test data doesn’t have the target variable which is the ‘SalePrice’.

Here is a data description file. (data_description.txt)

SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict.
LotArea: Lot size in square feet
Neighborhood: Physical locations within Ames city limits
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
TotalBsmtSF: Total square feet of basement area
GrLivArea: Above grade (ground) living area square feet

e.t.c

I dived into exploring the data and I started with a descriptive statistic summary of the target variable (SalePrice) .

This gives us the count, mean, standard deviation, percentile 25%, 50% and 75%, min and max.

I went ahead plotted the distribution of the ‘SalePrice’ and normal probability graph which is used to identify substantive departures from normality. This includes identifying outliers, skewness and kurtosis. I used the QQ-plot

The SalePrice deviates from normal distribution and is positively biased

The SalePrice also does not align with the diagonal line which represent normal distribution in normal probability graph. A normal probability plot of a normal distribution should look fairly straight, at least when the few large and small values are ignored.

Went on to use logarithmic transformation to make highly skewed distributions less skewed.

Reduced skewness and kurtosis after log transformation

Correlation Analysis

Here I explored the correlation between the features and the target by plotting a correlation matrix

from the figure above I could only tell that OverallQual, TotalBsmtSF, GarageCars, GarageArea have a high positive correlation with SalePrice. Unfortunately, the colors are not very clear. Below is a better representation with values.

From this we can tell which features (OverallQual, GrLivArea, GarageCars, GarageArea, and TotalBsmtSF)are highly positively correlated with the SalePrice. Here is a correlation guide.

GarageCars and GarageArea have strongly correlation, TotalBsmtSF and 1stFlrSF have similarly high correlation. I assumed that the number of cars stored in a garage strongly depends on the area of the garage. On looking at the data homes with 0 GarageCars had 0 GarageArea indicating they don’t have a garage.

Here are some scatter plots to visualize some of the features relationship against the SalePrice.

For more scatter plots here is the notebook.

Dealing with outliers

I deleted the GrLivArea as it was recommended by the author of the data . “I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations) before assigning it to students.”

Inputting Missing Values

Let’s have a look at the training data (train.csv)

To try and understand what the missing values are I looked at the data documentation this helped me transform other features to reflect the assumptions I made. for Example, GarageArea is zero, indicates we don’t have a garage and cars should be transformed to 0 as well. I also set mode value for missing entries and transformed numerical features to categorical features.One other key feature that I added was the Total surface area as TotalSF by adding together TotalBsmtSF, 1stFlrSF and 2ndFlrSF.

I also log transformed highly skewed features using box cox transformation which is a way to transform non-normal dependent variables into a normal shape. This were 59 skewed features. Normality is an important assumption for many statistical techniques; In this case the data isn’t normal, applying a Box-Cox also increases the ability to run a broader number of tests.Went on to adding of dummy variables for categorical features

Linear regression modelling

I used LASSO (least absolute shrinkage and selection operator) and Gradient boosting regression models to train the dataset and make predictions separately.

LASSO is a regression model that does variable selection and regularization. The LASSO model uses a parameter that penalizes fitting too many variables. It allows the shrinkage of variable coefficients to 0, which essentially results in those variables having no effect in the model, thereby reducing dimensionality. Since there are quite a few explanatory variables, reducing the number of variables may increase interpretability and prediction accuracy.

Gradient boosting models are one of the most popular algorithms on Kaggle. A variant of GBMs known as the XGBoost has been a clear favorite for many recent competitions. The algorithm works well right out of the box. It is a type of ensemble model, like the random forest, where multiple decision trees are used and optimized over some cost function. The popularity and ability to score well in competition are reasons enough to use this type of model for house price prediction problem.

I used cross validation and RMSLE(Root mean squared logarithmic error ) to see how well each model performs.

Here are the scores I received from my two models

Went ahead and stacked the two models together it has been proved by several research papers and here, that stacking two models together what is called ensembling can improve the accuracy.