Prediction Model- Building a house price model

Han Man
han_
Published in
13 min readFeb 17, 2017

Having lived in NYC for three years… if there is one thing that I had been taking for granted, it’s that housing and rental prices continue to rise.

Via 57 West in Manhattan

Since the housing crisis of 2009, housing prices have recovered remarkably well, especially in major housing markets. However, in the 4th quarter of 2016, I was surprised to read that manhattan housing prices YoY had fallen the most in the last 4 years. In fact, median resale prices for condos and coops fell 6.3%, marking the first time there was a decline since Q1 of 2015. The decline has been partly attributed to political uncertainty domestically and abroad with the Brexit and the 2016 election (RIP democracy).

Housing prices in major markets have been growing steadily… http://www.economist.com/blogs/graphicdetail/2016/08/daily-chart-20

An interesting question came to mind: does this mean that housing prices are plateauing again? Understanding what goes into predicting housing prices is a good first step to answering this question. To approach this problem, I needed to work with a house price data set.

Kaggle.com is a site dedicated to data analysis and filled with all kinds of competitions, challenges, and data sets to explore. Often, this data is posted by same companies who need help analyzing it. On the site, there was a competition running that challenged data scientists to create a home price prediction model.

This competition involved data about home characteristics and housing prices in Ames, Iowa. It was a bit further from Manhattan than I had hoped, but it was a good place to start.

My goal was to:

1) Create an effective price prediction model

2) Validate the model’s prediction accuracy

3) Identify the important home price attributes which feed the model’s predictive power.

Data Cleaning and Processing

The first step was to import the data and begin to understand what I am working with. I used the following packages in python to conduct this analysis:

#Declarations:
import pandas as pd
import numpy as np
import re
from datetime import datetime
from scipy import stats
import random
import sklearn
from sklearn import datasets, linear_model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from math import sqrt
from IPython.core.display import Image, HTMLimport matplotlib
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
%matplotlib inline

I imported the data into pandas:

#Import my Data
#From Kaggle Competition: House Prices: Advanced Regression Techniques
#This is the "training" data set I downloaded from the Kaggle Competition
#https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
data=pd.read_csv("train.csv", na_filter=False)
data.describe
data.info()
data.head(5)
features= ["Id", "LotArea", "Utilities", "Neighborhood", "BldgType", "HouseStyle", "OverallQual",
"OverallCond", "YearBuilt", "YearRemodAdd", "RoofStyle", "RoofMatl", "GrLivArea",
"FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "MoSold", "YrSold", "SalePrice"]
numfeat= ["Id", "LotArea", "OverallQual",
"OverallCond", "YearBuilt", "YearRemodAdd", "GrLivArea",
"FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "MoSold", "YrSold", "SalePrice"]
catfeat= ["Id", "Utilities", "Neighborhood", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl"]Rdata=data.loc[:, features]
Rdata.info()
Rdata.head(5)
Rdata.tail(5)
#check if IDs are unique
data.columns
Column headers for the housing data set.

The dataset contained the following attributes, with 1460 entries. The data is remarkably clean, with no null values to strip out. This is the benefit of starting with a Kaggle dataset- I can focus on the data analysis and interpretation instead of the cleaning. Using the data type, I was able to identify the columns of numerical and categorical data. The last column, SalePrice, is what we are trying to predict. Before I built any models for this data, I wanted to visualize the data:

#Quick check, are there numerical relationships we can spot among the numerical dataplotfeats= ["LotArea", "OverallQual", 
"OverallCond", "YearBuilt", "YearRemodAdd", "GrLivArea",
"FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "MoSold", "YrSold"]
for f in plotfeats:
#ax = plt.gca()
ax=plt.subplots(figsize=(6,3))
ax=sns.regplot(x=Rdata[f], y=Rdata['SalePrice'])
plt.show()
sns.heatmap(Rdata.corr())

I plotted each numeric variable versus sales price to visualize the distribution of the data and the relationship between the variable and sales price. Then I looked at the correlation between the variables using a heatmap to see which variables are most correlated with sales price.

Heatmap of the correlation between numerical attributes.

Looking across the row for sales price (the last row), there are clearly some variables which have a stronger relationship with sales price, such as Overall Quality and Living Area. Curiously, Overall Condition and Kitchen Above Ground are negatively correlated with price.

Then, I visualized the categorical data using violin plots. These plots are so useful because they are similar to box plots, which show the range of the data, but also includes an additional component (width of the violin) which depicts the frequency of the data:

#Let's see what type of categorical data we have
catdata=Rdata.loc[:, catfeat]
cat_values={}
for n in catfeat[1:len(catfeat)]:
print(n)
print(pd.value_counts(catdata[n]))
ax = plt.subplots(figsize=(7, 2.5))
plt.xticks(rotation='vertical')
ax=sns.violinplot(x=n, y="SalePrice", data=Rdata, linewidth=1)
plt.show()
Utility Type versus Sale Price
Neighborhood versus Sale Price
Building Type versus Sale Price
House Style versus Sale Price
Roof Style versus Sale Price
Root Material versus Sale Price

Modeling the Data

Now that we know what we are working with, we can begin to construct a model to predict sale price. Since the data is so clean and a few of the variables appear to have a strong correlation to sale price, I wanted to start by implementing a linear regression model. First, I needed to turn my categorical data into dummy columns. This involves creating a column for each of the category types within category columns and assigning a 1 or 0 to an entry if the category is met. Using dummy variables allows us to isolate the contribution of each category type within the categorical columns because we can fit individual beta values to each category type when we fit a linear regression model to the feature set. This also makes it easier to interpret the model betas afterwards. Pandas has a prebuilt function that takes care of the dummification process:

#Now I want to turn my categorical variables into dummy variables before I do any regressionsRdatadum=Rdata
categories = catfeat[1:len(catfeat)]
for category in categories:
series = Rdatadum[category]
dummies = pd.get_dummies(series, prefix=category)
Rdatadum = pd.concat([Rdatadum, dummies], axis=1)
print Rdatadum.columns
Rdatadum.head()
Total list of columns for our data after creating dummy variables.

The total list of columns is shown on the left. For example, you can see that there is an individual column created for each neighborhood. Before we can fit a model, we will have to remove the original category columns such as the Neighborhood column.

removefeats= ["Id", "Utilities", "Neighborhood", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", 'SalePrice']X = Rdatadum.drop(removefeats, axis=1)
y = Rdatadum['SalePrice']
lr = linear_model.LinearRegression()
lr_model = lr.fit(X, y)
y_pred = lr_model.predict(X)
lr_r2 = r2_score(y, y_pred)
bx=plt.subplots(figsize=(12,5))
bx= sns.barplot(x=0, y=1, data=pd.DataFrame(zip(X.columns, lr_model.coef_)))
plt.xticks(rotation='vertical')
plt.xlabel("Model Coefficient Types")
plt.ylabel("Model Coefficient Values")
plt.show()
print "R squared: ", (lr_r2)
print "Average Coefficients: ", (abs(lr_model.coef_).mean())
print "Root Mean Squared Error: ", sqrt(mean_squared_error(y, y_pred))
ax = sns.regplot(y, y_pred)

I used a Linear Regression model within the SKLearn package to fit the data. As a baseline, I included all of the attributes in a simple un-cross-validated model. This is a plot of the real versus predicted Sale Price.

Sale Price versus predicted Sale Price in the first Linear Regression model: LR1
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(lr_model, X, y, n_jobs=1, cv=5))
0.8191485593927712

The cross validated score for the baseline model is 81.9%- a pretty good predictor so far. Let’s see how we can improve the model. To reduce overfitting of the model, there are two common regularization techniques called Lasso and Ridge (or Tikhonov) regularization. Both of these techniques use cross validation to reduce the beta values of the regression model, trading off a lower r squared value for higher predictive power. SKLearn has two built in functions- LassoCV and RidgeCV, which helps fit regularized linear regression models.

XR=X
yR=y
rcv = linear_model.RidgeCV(alphas=
(.001, .001, .01, .1, .5, 1,2,3,4,5,6,7,8,9,10),
cv=5,
)
rcv_model = rcv.fit(XR, yR)print "Alpha Value: ", (rcv_model.alpha_)y_predR = rcv_model.predict(XR)
lr_r2 = r2_score(yR, y_predR)
#lr_r2 = rcv.score(XR, yR)
print "R squared: ", (lr_r2)
print "Average Coefficients: ", (abs(rcv_model.coef_).mean())
print "Root Mean Squared Error: ", sqrt(mean_squared_error(yR, y_predR))
ax = sns.regplot(yR, y_predR)
Sale Price versus predicted Sale Price in the first RidgeCV model: RCV1
XL=X
yL=y
lcv = linear_model.LassoCV(alphas=
(.001, .001, .01, .1, .5, 1,10,25,30,35,40,41,42,43,44,45,46,47,48,49,50,60,100), cv=5)
lcv_model = lcv.fit(XL, yL)print "Alpha Value: ", (lcv_model.alpha_)y_predL = lcv_model.predict(XL)
lr_r2 = r2_score(yL, y_predL)
print "R squared: ", (lr_r2)
print "Average Coefficients: ", (abs(lcv_model.coef_).mean())
print "Root Mean Squared Error: ", sqrt(mean_squared_error(yL, y_predL))
bx=plt.subplots(figsize=(12,5))
bx= sns.barplot(x=0, y=1, data=pd.DataFrame(zip(XL.columns, lcv_model.coef_)))
plt.xticks(rotation='vertical')
plt.xlabel("Model Coefficient Types")
plt.ylabel("Model Coefficient Values")
plt.show()
ax = sns.regplot(yL, y_predL)
plt.ylabel("Predicted Sale Price")
plt.show()
Sale Price versus predicted Sale Price in the first LassoCV model: LCV1
np.mean(cross_val_score(rcv_model, XR, yR, n_jobs=1, cv=5))
np.mean(cross_val_score(lcv_model, XL, yL, n_jobs=1, cv=5))
Ridge CV Model Score: 0.822158080616
Lasso CV Model Score: 0.822819369116

Comparing these two regularized models with the original linear regression model, the cross validation score improved from 81.9% to ~82.2%. This improvement was achieved by constraining the beta values of the model: the average beta value fell from ~24k in the unregularized model, to ~14.5k and ~16.2k in the regularized model. Plotting the beta coefficients for LR1 next to LCV1 illustrates this point clearly:

Model beta values for the first linear regression model: LR1
Model beta values for the first LassoCV model: LCV1

The top bar chart shows the beta values for LR1. The Lasso CV regularization method constricts the beta values, and tunes some of them down to zero. (Note the scale change in the figure for the betas in LCV1 top out at 10,000 while the top chart is 20,000). These regularization methods have helped us improve our model. In order to further improve our model, I needed to perform some feature selection and feature engineering.

I tried altering the feature using 3 rounds of feature selection/eng:

  1. The LotArea is a value that I expect to correlate well with Sale Price- the larger the house, the pricier it should be, all else being equal right? The relationship, however, was not clean:
Lot Area versus Sale Price.

It appeared that the linear fit was being skewed by a few outliers. The data itself did not appear to be linear. I decided to remove all points of LotArea>50000 and use a log transformation on the LotArea variable.

I also decided to prune out variables that had a low correlation to Sale Price, using the heat map I produced earlier as a guide- Overall Condition, Half Bath, Month Sold, Year Sold. I also removed the Utilities categorical variable because there all entries but one fell into the “All Pub” utilities category.

#Let's try some feature engineering to choose the right sample to work on
#RO "remove outliers" for Lot Area
X_RO=X[X['LotArea']<50000]y2 = Rdatadum[Rdatadum['LotArea']<50000]['SalePrice']#Convert Lot Area to Log of Lot AreaX_RO['LogLotArea']=X_RO['LotArea'].apply(np.log)#And try to remove features-
#numerical features that have low correlation to sale price
#And categorical features that do not have big differences in sale price across categories OR
#do not have significant distribution of samples in all categories
removefeats=['OverallCond' , 'HalfBath', 'MoSold', 'YrSold', 'Utilities_AllPub',
'Utilities_NoSeWa', 'LotArea']
X_ROnew=X_RO.drop(removefeats, axis=1)
X_ROnew.head()

2. In addition to everything from (1), I tried dropping the LotArea variable all together.

X3R=X2R.drop('LogLotArea', axis=1)
y3R=y2R

3. I brought back the log of LotArea variable back, but I removed a couple of other numerical variables that had low correlation to Sale Price: Year Built, Year Remodel Added, Bedroom Above Ground, Kitchen Above Ground.

#Let's try removing more features that have low correlation, and bring back Log of Lot Area
removemorefeats=['YearBuilt', 'YearRemodAdd', 'BedroomAbvGr', 'KitchenAbvGr']
X4R=X2.drop(removemorefeats, axis=1)
y4R=y2

After running the models with the feature engineering and selection described above, I obtained the following cross validation scores:

print "LR model after round (1) of feature selection/engineering: " , np.mean(cross_val_score(lr_model2, X2, y2, n_jobs=1, cv=5))
print "RidgeCV model after round (1) of feature selection/engineering: " ,np.mean(cross_val_score(rcv_model2, X2R, y2R, n_jobs=1, cv=5))
print "LassoCV model after round (1) of feature selection/engineering: ", np.mean(cross_val_score(lcv_model2, X2L, y2L, n_jobs=1, cv=5))
print "RidgeCV model after round (2) of feature selection/engineering: ", np.mean(cross_val_score(rcv_model3, X3R, y3R, n_jobs=1, cv=5))
print "RidgeCV model after round (3) of feature selection/engineering: ", np.mean(cross_val_score(rcv_model4, X3R, y3R, n_jobs=1, cv=5))

LR model after round (1) of feature selection/eng: 0.838734879997
RidgeCV model after round (1) of feature selection/eng: 0.837822313779
LassoCV model after round (1) of feature selection/eng: 0.838073016076
RidgeCV model after round (2) of feature selection/eng: 0.834906701028
RidgeCV model after round (3) of feature selection/eng: 0.835601728922

The best cross validation score obtained was 83.9%, the LR after round (1).

Drawing Conclusions from the Data

We can examine the characteristics of this model:

#Let's examine the best performing model I have so farvals=[[a[0],a[1]] for a in zip(XL.columns,lr_model2.coef_)]
val=pd.DataFrame(vals, columns=['Feature', 'Beta Value'])
val['Absolute Value']=abs(val['Beta Value'])val.sort('Absolute Value', ascending=False)
Top beta values for our best performing model.

The top beta values for our model is shown on the left. Surprisingly, Roof Material has an outsized impact on the outcome of our model. Apparently, these specific house characteristics are crucial to creating an accurate model. Neighborhood also appears to be a very impactful category. When we examine the violin plots of neighborhood versus sale price, it stands out that NoRidge, NridgeHt, and StoneBr, has the highest range of prices, with by far the largest max prices. That is reflected in the beta value of those three neighborhoods having the highest beta values compared with the rest of the neighborhoods.

Violin plots of type of Neighborhood versus Sale Price.
Number of sales by Neighborhood.

In terms of the most sales in each neighborhood, NAmes, CollgCr and OldTown round out the top 3.

Now the moment of truth, to run our model against the test data set. Another great thing about using Kaggle is that they provide a test set of data to validate the model against, and also brings a competitive edge to the data work because your score can be compared against others trying to build a strong model of their own. An important step to remember is that all test data needs to be transformed in the same way as the training data, since the model has been tuned to accept data in the same form as the training set. For example, we must take the log of LotArea since our model has been trained on this input.

test=pd.read_csv("test.csv", na_filter=False)
test.columns
Rtest=test.loc[:, features]
Rtest.columns
Rtestdum=Rtestcategories = catfeat[1:len(catfeat)]
for category in categories:
series = Rtestdum[category]
dummies = pd.get_dummies(series, prefix=category)
Rtestdum = pd.concat([Rtestdum, dummies], axis=1)
#print Rtestdum.columns
Rtestdum.head()
# print Rtestdum['HouseStyle'].value_counts()
# print Rdatadum['HouseStyle'].value_counts()
# print Rtestdum['RoofMatl'].value_counts()
# print Rdatadum['RoofMatl'].value_counts()
removefeats= ["Id", "Utilities", "Neighborhood", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", 'SalePrice']Xtest = Rtestdum.drop(removefeats, axis=1)Xtest['LogLotArea']=Xtest['LotArea'].apply(np.log)
removefeats2=['OverallCond' , 'HalfBath', 'MoSold', 'YrSold', 'Utilities_AllPub', 'Utilities_NA', 'LotArea']
Xtest=Xtest.drop(removefeats2, axis=1)Xtest['HouseStyle_2.5Fin']=Xtest['OverallQual']*0
Xtest['RoofMatl_Membran']=Xtest['OverallQual']*0
Xtest['RoofMatl_Metal']=Xtest['OverallQual']*0
Xtest['RoofMatl_ClyTile']=Xtest['OverallQual']*0
Xtest['RoofMatl_Roll']=Xtest['OverallQual']*0
Xtest['Utilities_NoSeWa']=Xtest['OverallQual']*0
Xtest=Xtest.reindex(columns=list(X2.columns))ytest=lcv_model2.predict(Xtest)
sns.distplot(ytest)
yy=pd.DataFrame(ytest)
yy['SalePrice']=yy.pop(0)
yy['Id']=Rtestdum['Id']
yy.set_index('Id', inplace=True)
yy.to_csv('submission3.csv')

The distribution of predictions on the test set look like this:

It was time to submit this data to the official Kaggle leaderboard:

Conclusion

This was an incredibly learning experience iterating on building a predictive Sale Price model. The feature engineering and selection was able to create value in improving the model performance. However, the improvement was not very substantial compared with the first baseline model.

Examining our final model coefficients, I observed that a the categorical features with the largest effect on housing price predictive power are roof style, roof material, and neighborhood. Because the data was not normalized before going into the model, however, it is difficult to compare beta values across numerical features. A good next step will be to try these models again with data normalization to see if model performance is improved and model interpretability as well. The specific takeaways from this specific model, however, are confined to the Ames, Iowa housing market. While I may not be able to directly apply this model to other geographies, at least I have a better understanding of the variables I need to gather in order to build a useful model.

The work here is only part 1 of this housing sale price project. The hope for next steps is to build a model tuned to Manhattan or New York housing prices so that we can dive further into predicting future housing prices and understanding when or if a housing crash is looking likely on the horizon. Stay tuned.

--

--