Predicting Housing Prices using Cross Validation and Grid Search in Regression Models

Yohan Jeong
11 min readMay 5, 2020

Analysis and prediction for the housing market prices using Cross Validation and Grid Search in several regression models

Photo by SGC on Unsplash

In this article, I analyze the factors related to housing prices in Melbourne and perform the predictions for the housing prices using several machine learning techniques: Linear Regression, Ridge Regression, K-Nearest Neighbors (hereafter, KNN), and Decision Tree. Using the methods of the Cross Validation and Grid Search techniques, I find the optimal values for hyper parameters in each model, and compare the results to find the best machine learning model to predict the housing prices in Melbourne.

The entire code for this project and the data are here.

The Data Set

The data for this analysis is the Melbourne Housing Market from the Kaggle dataset. The total number of rows and columns are 34,857 and 21, respectively. The columns are as follows:

df = pd.read_csv('...\Melbourne_housing_FULL.csv')
df.columns.to_list()
['Suburb','Address','Rooms','Type','Method','SellerG','Date','Distance','Postcode','Bedroom2','Bathroom','Car','Landsize','BuildingArea','YearBuilt','CouncilArea','Latitude','Longitude','Regionname','Propertycount','Price']

Data Pre-Processing

In this section, how the missing values and outliers are dealt is briefly explained.

Missing Values

Using missingno library in Python, check missing values in the data set.

import missingno as msno
msno.bar(df)
  • According to the correlation coefficients, Rooms is a good proxy for Bathroom and Car. After creating a categorical feature indicating whether the level of Rooms is high, medium, or low for each house, the medians of Bathroom and Car are calculated within a group which has the same level of Rooms. Then, the calculated medians are put for the missing values in Bathroom and Car.
  • The rows having missing values in Price are dropped.
  • The rest of the features having missing values are dropped.

Outliers

For outliers, I employ the Interquartile Range(IQR) method. This technique finds data points that fall outside of 1.5 times of an interquartile range above the 3rd quartile(Q3) and below the 1st quartile(Q1), and drops those entries from the analysis.

The Q1 and Q3 for each numerical feature and the target Price are as follows:

  • Price — $635,000 and $1,295,000
  • Rooms — 2 rooms and 4 rooms
  • Distance — 6.4 kilometers and 14 kilometers
  • Bathroom — 1 and 2 rooms.
  • Car — 1 and 2 spots

For Price, the data point for 1.5 times of the IQR above the Q3 (Upper Whisker) is $2,285,000 and the data point for 1.5 times of the IQR below the Q1 (Lower Whisker) is -$355,000. The distribution of Price after removing outliers is as follows:

import seaborn as sns
import matplotlib.pyplot as plt
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
df = df[(df['Price']>Lower_Whisker)&(df['Price']<Upper_Whisker)]
plt.figure(figsize=(10,5))
sns.distplot(df['Price'],hist=True, kde=False, color='blue')
plt.ylabel('Counts')
The Distribution of Price After Removing Outliers

Exploratory Data Analysis (EDA)

Numerical Features

We have four numerical features in our analysis: Rooms, Bathroom, Car, and Distance. The best way to see the relationships between numerical features and the target at a glance is to draw scatter plots. A hexbin plot splits areas into several hexbins in the graph, and the color of each hexbin denotes the number of data points. As the color of a hexbin gets darker, there are more data points in the hexbin.

Let’s see the hexbin scatter plots to show the relationships of the numerical features with Price.

import matplotlib.image as mpimgJG1 = sns.jointplot('Rooms', 'Price', data=df, kind='hex', color='g')
JG2 = sns.jointplot('Bathroom', 'Price', data=df, kind='hex', color='b')
JG3 = sns.jointplot('Car', 'Price', data=df, kind='hex', color='r')
JG4 = sns.jointplot('Distance', 'Price', data=df, kind='hex', color='orange')
JG1.savefig('JG1.png')
plt.close(JG1.fig)
JG2.savefig('JG2.png')
plt.close(JG2.fig)
JG3.savefig('JG3.png')
plt.close(JG3.fig)
JG4.savefig('JG4.png')
plt.close(JG4.fig)
f, ax = plt.subplots(2,2,figsize=(20,16))ax[0,0].imshow(mpimg.imread('JG1.png'))
ax[0,1].imshow(mpimg.imread('JG2.png'))
ax[1,0].imshow(mpimg.imread('JG3.png'))
ax[1,1].imshow(mpimg.imread('JG4.png'))
[ax.set_axis_off() for ax in ax.ravel()]
plt.tight_layout()

The plots above clearly show the positive relationships of Rooms, Bathroom, Car with Price and the negative relationship of Distance with Price.

Categorical Features

The categorical features are Regionname and Type.

Regionname

Regionname has 7 unique values: Northern Metropolitan, Southern Metropolitan, Western Metropolitan, Eastern Metropolitan, South-Eastern Metropolitan, Northern Victoria, and Eastern Victoria. These region names are the Electoral Regions of Victoria. Basically, these regions are divided into eight regions. In our data, there’s no housing data located in the Western Victoria.

Let’s see the box plot between Regionname and Price.

plt.figure(figsize=(12,6))
sns.boxplot('Regionname', 'Price', data=df, width=0.3, palette="Set2")
plt.xticks(rotation=45)
df['Regionname'].value_counts()

In order to use this feature in the analysis, I create dummies for this feature and merge them into the data set.

regionname = pd.get_dummies(df['Regionname'],drop_first=True)
df = pd.merge(df, regionname, left_index=True, right_index=True)
df.drop('Regionname', axis=1, inplace=True)

Type

Our data set divides the housing types into three categories:

  • h — house, cottage, villa, semi, and terrace
  • u — unit and duplex
  • t — townhouse
plt.figure(figsize=(10,5))
sns.boxplot('Type', 'Price', data=df, width=0.3, palette="Set2")
df['Type'].value_counts()

About 60% of the observations is the type h, and about 25% is the type u. In the box plot, the type h appears to have the largest variance. The most expensive and the cheapest houses belong to the type h.

Let’s create dummies for Type as well and combine them into the data set.

house_type = pd.get_dummies(df['Type'], drop_first=True)
df = pd.merge(df,house_type, left_index=True, right_index=True)
df.drop('Type', axis=1, inplace=True)

Predictive Modeling

The regression models used in this analysis are Linear Regression, Ridge Regression, K-Nearest Neighbors, and Decision Tree.

Basic Predictions

To test the performance of predictions for each regression model, first I split the data into a training and a testing set. I fit the models using the training set first, and then predict the housing prices using the testing set. The testing data is set to be 30% out of the entire data.

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
X=df.drop('Price', axis=1)
y=df['Price']
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.3, random_state=0)

In order to measure the performance of predictions for each model, I use two performance metrics: R²(Coefficient of Determination) and MSE (Mean Squared Error)

The first one is the coefficient of determination, which is usually expressed as R². The coefficient of determination is the ratio of the variance of a target explained or predicted by a model over the total variance of the target. It ranges from 0 to 1, and as the value is closer to 1, the model explains or predicts the variance of the target better.

  • MSE

The second metric is the Mean Squared Error(MSE). The MSE is the average of the squared difference between the estimated or predicted values and the actual values of a target. This is always greater than zero. A lower value of the MSE indicates higher accuracy of predictions of a model. In this analysis, I use the square root of this metric(RMSE).

For convenience’ sake, let’s create a function to calculate R² and RMSE, and a function to compare the distributions of actual values and predicted values for each model.

def Predictive_Model(estimator):
estimator.fit(train_X, train_y)
prediction = estimator.predict(test_X)
print('R_squared:', metrics.r2_score(test_y, prediction))
print('Square Root of MSE:',np.sqrt(metrics.mean_squared_error(test_y, prediction)))
plt.figure(figsize=(10,5))
sns.distplot(test_y, hist=True, kde=False)
sns.distplot(prediction, hist=True, kde=False)
plt.legend(labels=['Actual Values of Price', 'Predicted Values of Price'])
plt.xlim(0,)
def FeatureBar(model_Features, Title, yLabel):
plt.figure(figsize=(10,5))
plt.bar(df.columns[df.columns!='Price'].values, model_Features)
plt.xticks(rotation=45)
plt.title(Title)
plt.ylabel(yLabel)

Linear Regression

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Predictive_Model(lr)

The R² of the linear regression is 0.6146, which means that about 60% of the variance of the housing prices in the data can be predicted by the model. The RMSE is 264,465. This means that for all the predictions for the testing set, the average difference for each prediction is $264,465.

Ridge Regression

from sklearn.linear_model import Ridge
rr = Ridge(alpha=100)
Predictive_Model(rr)

The result above is obtained with the regularization parameter(alpha) equal to 100. The R² of this model is 0.6133, and the RMSE is 264,920.

K-Nearest Neighbors(KNN)

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)
Predictive_Model(knn)

The result above is obtained with the number of the neighbors equal to 5. The R² of this model is 0.7053, and the RMSE is 231,250.

Decision Tree

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth=15, random_state=0)
Predictive_Model(dt)

In the decision tree model, the max depth is one of the factors to prevent the over-fitting issue of the model. As the depth of the tree is greater, the tree has more branches and becomes bigger. As the tree has more branches, the prediction for the training set can be more accurate. However, there is a bigger variance in predicting the testing set. Therefore, setting the max depth optimally is important in order to avoid the over-fitting issue. In the example above, max depth is set to 15. The R² of this model is 0.6920, and the RMSE is 236,424.

Performance Summary

regressor = ['Linear Regression', 'Ridge Regression', 'KNN', 'Decision Tree']
models = [LinearRegression(), Ridge(alpha=100), KNeighborsRegressor(n_neighbors=5), DecisionTreeRegressor(max_depth=15, random_state=0)]
R_squared = []
RMSE = []
for m in models:
m.fit(train_X, train_y)
prediction_m = m.predict(test_X)
r2 = metrics.r2_score(test_y, prediction_m)
rmse = np.sqrt(metrics.mean_squared_error(test_y, prediction_m))
R_squared.append(r2)
RMSE.append(rmse)
basic_result = pd.DataFrame({'R squared':R_squared,'RMSE':RMSE}, index=regressor)
basic_result

In the table above, the KNN seems to be the optimal model to predict the housing prices in Melbourne. However, it is too early to make a conclusion since there are more things to be considered. The first is that we use only one specific selection of a training and testing sets, and the second is that for each model we chose one specific value for each hyper parameter. To get a robust result covering these issues, we need to go through the cross validation and grid search process as well.

Cross Validation and Grid Search

Cross Validation(CV) is a re-sampling procedure when the amount of data is limited. This randomly splits the entire data into K-folds, fit a model using (K-1) folds, validates the model using the remaining fold, and then evaluates the performance through metrics. After this, CV repeats this whole process until every K-fold is used as the testing set. The average of the K-number of scores of a metric is the final performance score for the model.

Grid-search is the process of tuning hyper parameters to find the optimal values of the parameters for a model. The prediction results can vary depending on the specific values for the parameters. The grid-search technique applies all the possible candidates for the parameters to find out the optimal one to give the best predictions for the model.

Linear Regression

scoring={'R_squared':'r2','MSE':'neg_mean_squared_error'}def CrossVal(estimator):
scores = cross_validate(estimator, X, y, cv=10, scoring=scoring)
r2 = scores['test_R_squared'].mean()
mse = abs(scores['test_Square Root of MSE'].mean())
print('R_squared:', r2)
print('Square Root of MSE:', np.sqrt(mse))
CrossVal(LinearRegression())R_squared: 0.5918115585795747
Square Root of MSE: 269131.0885647736

Since the linear regression does not have any hyper parameter in our analysis, only CV is performed here. The number of the folds in the CV is set to be 10. The average of the R² is 0.5918 and the RMSE is 269131.

Ridge Regression

The regularization parameter in the ridge regression is expressed as alpha in sklearn. Since GridSearchCV in sklearn includes the process of the Cross Validation, the process of performing cross_validate is omitted. The set of the grid for alpha is set to be [0.01, 0.1, 1, 10, 100, 1000, 10000] here.

from sklearn.model_selection import GridSearchCVdef GridSearch(estimator, Features, Target, param_grid):
for key, value in scoring.items():
grid = GridSearchCV(estimator, param_grid, cv=10, scoring=value)
grid.fit(Features,Target)
print(key)
print('The Best Parameter:', grid.best_params_)
if grid.best_score_ > 0:
print('The Score:', grid.best_score_)
else:
print('The Score:', np.sqrt(abs(grid.best_score_)))
print()
param_grid = {'alpha':[0.01, 0.1, 1, 10, 100, 1000, 10000]}GridSearch(Ridge(), X, y, param_grid)R_squared
The Best Parameter: {'alpha': 10}
The Score: 0.5918404945235951
Square Root of MSE
The Best Parameter: {'alpha': 10}
The Score: 269125.20208461734

The result shows that the best value of the alpha is 10. The R² and the RMSE are 0.5918 and 269125, respectively, under alpha = 10.

K-Nearest Neighbors

The hyper parameter for KNN we use in this analysis is the number of the nearest neighbors(n_neighbors). The range for the grid is the integers from 5 to 25.

param_grid = dict(n_neighbors=np.arange(5,26))GridSearch(KNeighborsRegressor(), X, y, param_grid)R_squared
The Best Parameter: {'n_neighbors': 16}
The Score: 0.6973921821195777
Square Root of MSE
The Best Parameter: {'n_neighbors': 16}
The Score: 232900.0204190322

The optimal number of n_neighbors is 16. The R² is 0.6974 and the RMSE is 232900. We can see how 16 is the optimal value for n_neighbors in our analysis via looking at the Validation Curve.

from sklearn.model_selection import validation_curvedef ValidationCurve(estimator, Features, Target, param_name, Name_of_HyperParameter, param_range):

train_score, test_score = validation_curve(estimator, Features, Target, param_name, param_range,cv=10,scoring='r2')
Rsqaured_train = train_score.mean(axis=1)
Rsquared_test= test_score.mean(axis=1)

plt.figure(figsize=(10,5))
plt.plot(param_range, Rsqaured_train, color='r', linestyle='-', marker='o', label='Training Set')
plt.plot(param_range, Rsquared_test, color='b', linestyle='-', marker='x', label='Testing Set')
plt.legend(labels=['Training Set', 'Testing Set'])
plt.xlabel(Name_of_HyperParameter)
plt.ylabel('R_squared')
ValidationCurve(KNeighborsRegressor(), X, y, 'n_neighbors', 'K-Neighbors',np.arange(5,26))

Decision Tree

In the decision tree model, there can be several hyper parameters to be considered. In our analysis, only the max_depth is the option for the hyper parameter. The range of the max_depth to be checked is the integers from 2 to 14.

param_grid=dict(max_depth=np.arange(2,15))GridSearch(DecisionTreeRegressor(), X, y, param_grid)R_squared
The Best Parameter: {'max_depth': 9}
The Score: 0.6844562874572124
Square Root of MSE
The Best Parameter: {'max_depth': 9}
The Score: 237708.76352194021

The result indicates that the optimal value for the max_depth is 9. The R² is 0.6845 and the RMSE is 237708 under max_depth=9. This is confirmed in the validation curve as well.

ValidationCurve(DecisionTreeRegressor(), X, y, 'max_depth', 'Maximum Depth', np.arange(4,15))

Cross Validation Summary

The table and the graphs below shows the scores of the R² for each round of testing in CV. Since cv is set to be 10, we have 10 rounds of testings.

lr_scores = cross_validate(LinearRegression(), X, y, cv=10, scoring='r2')
rr_scores = cross_validate(Ridge(alpha=10), X, y, cv=10, scoring='r2')
knn_scores = cross_validate(KNeighborsRegressor(n_neighbors=16), X, y, cv=10, scoring='r2')
dt_scores = cross_validate(DecisionTreeRegressor(max_depth=9, random_state=0), X, y, cv=10, scoring='r2')
lr_test_score = lr_scores.get('test_score')
rr_test_score = rr_scores.get('test_score')
knn_test_score = knn_scores.get('test_score')
dt_test_score = dt_scores.get('test_score')
box= pd.DataFrame({'Linear Regression':lr_test_score, 'Ridge Regression':rr_test_score, 'K-Nearest Neighbors':knn_test_score, 'Decision Tree':dt_test_score})
box.index = box.index + 1
box.loc['Mean'] = box.mean()
box

According to the result in the table, the best machine learning model in our analysis is the KNN since the mean of the scores for each round is the highest for KNN.

f,ax=plt.subplots(1,2, figsize=(12,5))sns.boxplot(data=box.drop(box.tail(1).index), width=0.3, palette="Set2", ax=ax[0])
ax[0].set_ylabel('R squared')
sns.lineplot(data=box.drop(box.tail(1).index), palette="Set2", ax=ax[1])
ax[1].set_xticks(np.arange(1,11,1))
ax[1].set_xlabel('K-th Fold')

The box and line plots above show the distributions and the changes of the scores for each model. The decision tree model as well as KNN shows a good performance in our analysis. Linear and ridge regressions do not show a significant difference in their performance.

--

--

Yohan Jeong

Business Analyst at Samsung Electronics America. PhD in Economics from University of California, Davis.