House price prediction:

Michell Payano Perez
Analytics Vidhya
Published in
6 min readOct 11, 2021

Performing a few experiments using tree based models to predict house prices and Bayesian Optimization for hyperparameter tuning

Photo by Blake Wheeler on Unsplash

The aim of this article is to compare the results of 3 experiments using the dataset of the Kaggle challenge: House prices prediction. In each experiment we will employ the same 6 tree-based models, but different data reduction methods. The main reason of performing these experiments is to appreciate the changes on the performance of the regressors when we use different feature subsets.

For each experiment, we are going to perform hyperparameter tuning using Bayesian optimization, which in contrast to Grid Search, it doesn’t perform an exhaustive search over all the specified parameter values, but rather it uses a fixed number of parameter settings sampled from the specified distribution. This method is well known for choosing the next combination of hyperparameters based on information from the previous ones.

For the sake of simplicity, we will only see some extracts of the code step by step before moving on to present the plots and tables about the results of each experiment. However, the code in detail can be found in GitHub.

1.Step one: After importing the corresponding libraries, the missing values where handled by either replacing or dropping them.

#According to the description of this dataset, replace NAN from these columns with 'None'. 
with_NA_type=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']
for i in with_NA_type:
train.loc[:,i]=train.loc[:,i].fillna('None')
#Replace NAN of this column with median
train["LotFrontage"] = train.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
#Since there are a few NAN in these columns we can drop them
train=train.dropna(subset=["Electrical", "MasVnrArea", "MasVnrType"])
train.loc[:,['GarageArea','GarageYrBlt']]=train[['GarageArea','GarageYrBlt']].fillna(0).values

2. Step two: After taking care of the missing values, the dataset is splitted in training and test set. Then, the function ColumnTransformer from scikit-learn will be used to apply some transformers (depending of the experiment) to the columns with categorical and numerical values.

train_x=train.iloc[:,:-1]
train_y=pd.DataFrame(train['SalePrice'])
numerical_ix = train_x.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = train_x.select_dtypes(include=['object', 'bool']).columns
X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.2, random_state=0)#Transformers used in the first experiment:
t1 = [('cat', OneHotEncoder(handle_unknown = "ignore"), categorical_ix),
('num', StandardScaler(), numerical_ix)]
col_transform1 = ColumnTransformer(transformers=t1)

3. Step three: Use BayesSearchCV of scikit-optimize in order to tune the hyperparameters of each model. Before that, you will see how I define the function “select_model”, which returns the pipeline that is going to be used as the estimator object of the BayesSearchCV as in the example below:

#The following function called "select_model" is going to retun the pipeline used in each estimation.def select_model(model_name,col_transform,selector=False,dim_red=False):
'''
model_name: Call the regressor
col_transform: Specify the ColumnTransformer method
selector: If True, introduce SelectFromModel into the pipeline.
dim_red: If True, introduce TruncatedSVD into the pipeline
'''

if (selector==True & dim_red==True):
param_list=[('prep', col_transform),
("select", SelectFromModel(model_name,max_features=1,threshold=-np.inf)),
('reduct',TruncatedSVD()),
('model',model_name)]
elif selector==True:
param_list=[('prep', col_transform),
('select', SelectFromModel(model_name,max_features=1,threshold=-np.inf)),
('model',model_name)]

elif dim_red==True:
param_list=[('prep', col_transform),
('reduct',TruncatedSVD()),
('model',model_name)]

else:
param_list=[('prep', col_transform),
('model',model_name)]

pipe = Pipeline(steps=param_list)
return pipe
#Hyperparameter tuning with BayesSearchCV:params={"model__max_depth":Integer(10,800),
"model__max_features":Real(0.5,1),
"reduct__n_components":Integer(2,40)}
result_dt = BayesSearchCV(estimator=select_model(DecisionTreeRegressor(),col_transform1,dim_red=True),search_spaces=params,cv=3,n_iter=300,
scoring='neg_root_mean_squared_error',iid=False,
return_train_score=True)
result_dt.fit(X_train,y_train.values.ravel())

Now let’s move on to explain with detail each experiments and their results.

First experiment

The first experiment consisted on applying a linear dimensionality reduction method. The aim of performing dimension reduction is to reduce the amount of input features in order to increase the performance of the models.

Since our dataset contains many categorical values which are going to be one hot encoded we need to use a dimension reduction technique suitable for sparse data, so we will use the truncated singular value decomposition (SVD). So, just as we are performing hyperparameter tuning for the parameters of the regressors, we are going to measure the best number of components as well.

The number of parameter settings sampled (number iterations) for this and the following experiments, is going to be 300. The RMSE for each iteration is shown below. In other words, you can see the RMSE given for each iteration or combination of hyperparameters for both validation and training set, and the red vertical line indicates the iteration with the best set of hyperparameters.

RMSE for each iteration (Figure created by the author)

By looking at the table below we see that the model with the lowest RMSE for the test set is Extra Trees Regressor (EXT), however, by comparing the RMSE for the training, validation and test set, this model is suffering from high variance. We could state that Adaptive Boosting (ADA) doesn’t perform as well as EXT but, it is the one that suffers the less from overfitting.

cv_result=[result_dt,result_rf,result_extra,result_gbr,result_ada,result_xgb]
models=["DT","RF","EXT","GBR","ADA","XGB"]
results_1 = pd.DataFrame()for i,j in zip(cv_result,models):
results_1.loc[j,'RMSE train']=mean_squared_error(y_train,i.predict(X_train),squared=False)
results_1.loc[j,'RMSE Val']=(i.best_score_)*(-1)
results_1.loc[j,'RMSE test']=mean_squared_error(y_test,i.predict(X_test),squared=False)
#Difference between RMSE of the training set and the test set to see how far they are from one of another
results_1['Dif']=results_1['RMSE train']-results_1['RMSE test']
results_1['Best params']=0
results_1['Best params'] = results_1['Best params'].astype(object)
#Best hyperparameters selected by the Bayesian Search
for i,j in zip(cv_result,models):
results_1.at[j,'Best params']=list(i.best_params_.items())
results_1
Table with the results of the first experiment

Second Experiment

For the second experiment, instead of using dimension reduction, we will select the most important features among the entire dataset, based on the attribute feature_importances_ of each model. With that being said, we are going to select the most important features with the method available in scikit learn: Select From Model. For more details about this technique and others please refer to the article Feature Selection: How should we perform it?

RMSE for each iteration (Figure created by the author)

In this case, the models that performed the best among the rest are Gradient Boosting (GBR) and Extreme Gradient Boosting (XGB). It is also interesting to note that ADA is underfitting since the RMSE of the training set is greater than the RMSE of the test set, but this is not the case for the validation set.

cv_result=[result_dt,result_rf,result_extra,result_gbr,result_ada,result_xgb]
models=["DT","RF","EXT","GBR","ADA","XGB"]
results_2 = pd.DataFrame()for i,j in zip(cv_result,models):
results_2.loc[j,'RMSE train']=mean_squared_error(y_train,i.predict(X_train),squared=False)
results_2.loc[j,'RMSE Val']=(i.best_score_)*(-1)
results_2.loc[j,'RMSE test']=mean_squared_error(y_test,i.predict(X_test),squared=False)
#Difference between RMSE of the training set and the test set to see how far they are from one of another
results_2['Dif']=results_2['RMSE train']-results_2['RMSE test']
results_2['Best params']=0
results_2['Best params'] = results_2['Best params'].astype(object)
#Best hyperparameters selected by the Bayesian Search
for i,j in zip(cv_result,models):
results_2.at[j,'Best params']=list(i.best_params_.items())
results_2
Table with the results of the second experiment

Third Experiment

Last but not least, the third experiment consists in applying feature selection and then dimension reduction. In other words, we will use the methods mentioned in the second and first experiment respectively in order to assess the impact of these techniques together.

It is important to note that, since in this experiment feature selection is performed before dimension reduction, n_features must be greater than n_components. However, whenever this does not hold, the RMSE will be equal to 100,000. This is possible by assigning a value to the “error_score” argument of BayesSearchCV.

RMSE for each iteration (Figure created by the author)

As a result of this experiment, the models that overfitted the less are ADA and XGB, even though the performance of GBR was the best among all models. However this performance came with the cost of higher variance.

cv_result=[result_dt,result_rf,result_extra,result_gbr,result_ada,result_xgb]
models=["DT","RF","EXT","GBR","ADA","XGB"]
results_3 = pd.DataFrame()for i,j in zip(cv_result,models):
results_3.loc[j,'RMSE train']=mean_squared_error(y_train,i.predict(X_train),squared=False)
results_3.loc[j,'RMSE Val']=(i.best_score_)*(-1)
results_3.loc[j,'RMSE test']=mean_squared_error(y_test,i.predict(X_test),squared=False)
#Difference between RMSE of the training set and the test set to see how far they are from one of another
results_3['Dif']=results_3['RMSE train']-results_3['RMSE test']
results_3['Best params']=0
results_3['Best params'] = results_3['Best params'].astype(object)
#Best hyperparameters selected by the Bayesian Search
for i,j in zip(cv_result,models):
results_3.at[j,'Best params']=list(i.best_params_.items())
results_3
Table with the results of the third experiment

By comparing all these models, we could say that Extreme Gradient Boosting from the third experiment, performed the best among those models with the lowest test RMSE and those that seemed to overfit the less. I hope this article inspired you to explore further methods/techniques for your projects and remember to find the entire code here for any reference.

--

--