Tama Handika
13 min readNov 3, 2017

Practicing Regression Techniques on House Prices Dataset-Part 2

This post is a continuation from my earlier post, here

We have explored parametric algorithms in the last post. Lets continue to non-parametric algorithms.

Training Models : SVM with RBF kernel

You can check the complete code in training/TrainSVM.py

SVM (Support Vector Machine) work for both classification and regression problem. The general objective of SVM is to draw a hyperplane (we can imagine it more or less like “line of the best fit”) that give the largest margin to the training observations (“Support Vector”) .

Larger margin means the model will generalize better on unseen data, it will have larger bias and lower variance. We can control the margin by modifying C parameter (also known as soft margin). Increasing the value of C will increase the variance and lower the bias of the model

One advantage of using SVM is its flexibility, thanks to the kernel trick. We could think kernel as a function that computes a similarity between a pair of samples. We will use the non-linear RBF kernel in this post.

One of the most important parameter for RBF SVM is gamma. This parameter control the influence of training samples/observations. Large gamma will lead to tighter decision boundary, hence higher bias.

The default parameter of C is 1, so I tried some values around 1 at first. However, GridSearchCV always choose the largest C in the input list. So, I keep increasing the value of C to increase the predictive performance of the SVM.

We will use 72 features chosen by Lasso. Let’s try to optimize the C and gamma. This is the last parameter after increasing the value of C on each iteration :

parameters = {
'C': [125000, 150000, 175000],
'kernel': ['rbf'],
'gamma': ['auto', 0.01, 0.1]
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')

This is the best parameter chosen by GridSearchCV :

SVR(C=150000, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.01,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

And this is the result from the model above :

Best GridSearchCV R^2 Score : 0.860728192269
Train MAE : 8013.55062267
Train R^2 : 0.9278630562
Test R^2 : 0.907566762948
Test MAE : 14829.6005909

It seems the SVM is working well. Large C (150000)and small gamma (0.01) means the model have high variance, so it’s more prone to overfitting. However, The gap between train and test set is reasonably small (approximately 0.02) hence there is no indication of overfitting. The R² result on the test score is also the highest so far.

Just for comparison, this is the result I got on the first few iterations, where the C parameter is much smaller, combined with same value of gamma :

Best GridSearchCV R^2 Score : 0.0472278732524
Train MAE : 51504.2482383
Train R^2 : 0.0605744315867
Test R^2 : 0.100912362848
Test MAE : 50073.7048117

Both train set and test set R² score is very low so this model is suffering from underfitting. Intuitively, we could think tuning the C and gamma parameter serve a similar purpose to tuning alpha parameter from L1 regularization earlier. It’s all about the bias-variance trade-off.

Training Models : Decision Tree Regressor

You can check the complete code in training/TrainDecTree.py

As the name suggest, Decision Tree algorithm will “grow” a binary tree (“yes” or “no”) based on a series of “questions”. The “questions” are generated from the features (i.e. Is OverallQual < 5?).

This algorithm will split the data based on the feature that will give the largest Information Gain. The Information Gain is computed by certain metrics, for example MSE and MAE.

In practice, we need to “prune” the tree most of the time because the tree will grow very deep. Intuitively, we can think that the tree is asking too many specific “questions” hence it will fit the training set very well but failed to generalize to unseen data.

There are 2 types of pruning : pre-pruning and post-pruning. Pre-pruning will stop growing the Decision Tree if certain condition (parameter) is met. On the other hand, Post-pruning will do the pruning after the tree is fully grown. Sklearn’s implementation of Decision Tree only support Pre-pruning at the time of this writing.

parameters = {
'criterion': ['mse', 'mae'],
'min_samples_leaf': [5, 10, 15, 20, 25],
'max_depth': [6, 9, 12, 15, 20],
'presort': [True],
'random_state': [1]
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')

We do pre-pruning by defining two conditions :

  1. max_depth is the maximum depth of the tree (duh!), exclude the root.
  2. min_samples_leaf is the minimum number of samples/observations in the leaf.

Larger value of max_depth will increase the variance. On the other hand, smaller value of min_samples_leaf will increase the variance.

Let’s take a look at the chosen parameters :

DecisionTreeRegressor(criterion='mae', max_depth=12, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=20,
min_samples_split=2, min_weight_fraction_leaf=0.0, presort=True,
random_state=0, splitter='best')

The result is not bad, but not particularly good compared to the other models :

Best GridSearchCV Score : 0.750726929887
Train MAE : 20434.3076264
Train R^2 : 0.794288771776
Test R^2 : 0.817249933667
Test MAE : 22062.2174658

One advantage of Decision Tree is we can print the tree, so the model don’t work like a “Black Box”.

The printed tree show us that it actually has 10 depth (max_depth = 12), with 20 minimum samples at the rightmost leaf (min_samples_leaf=20).

Training Models : Random Forest Regressor

You can check the complete code in training/TrainRF.py

Random Forest is an ensemble algorithm that combine multiple Decision Trees. In the case of regression problem, this algorithm will average the output (i.e. MSE/MAE) of multiple trees.

Random Forest tend to build a more robust model that has a better generalization performance on unseen data and less prone to overfitting compared to a single Decision Tree.

These are the simplified inner working of Random Forest :

  1. Choose random samples from training set with replacement (bootstrap parameter)
  2. Build a Decision Tree on the drawn sample, with a random subset of features (without replacement) at each node
  3. Repeat the 2 steps above
  4. Average the prediction defined by the criterion (MSE/MAE)

Similar with Decision Tree, we will pre-prune each of the individual trees via max_depth and min_samples_leaf parameters :

parameters = {
'n_estimators': [100],
'min_samples_leaf': [3, 5, 10, 20],
'max_depth': [9, 12, 15],
'n_jobs': [-1],
'random_state': [1]
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')

n_estimators is the number of trees in the forest.

This is the best parameters chosen by GridSearchCV :

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=12,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=True, random_state=1, verbose=0, warm_start=False)

The result is better than a single Decision Tree :

Best GridSearchCV Score : 0.824849224673
OOB Score : 0.828991092574
Train MAE : 11612.9739268
Train R^2 : 0.9258381402
Test R^2 : 0.896588642158
Test MAE : 16868.2702119

Random Forest can be use for feature selection via feature_importances_ attribute. One advantage of using Random Forest is it will punish redundant features if similar feature has been chosen, hence it seems to be a good choice for this dataset.

Training Models : XGBoost Regressor

You can check the complete code in training/TrainXGB.py

eXtreme Gradient Boosting is an advanced boosting algorithm that have a very good reputation on Kaggle. There is a saying : “if your predictive performance stuck, try XGBoost!”.

Boosting is a method that uses an ensemble (group) of weak learners (typically Decision Tree). This method will focus on mistakes made by the weak learners, so it lets the weak learners to learn from their mistakes to improve the overall predictive performance overtime.

The XGBoost package is not available (yet) on the sklearn library. In this post, we will try to do GridSearchCV on several parameters :

parameters = {
'n_estimators': [100],
'max_depth': [3, 6, 9],
'subsample': [0.8, 0.9, 1],
'reg_alpha': [0, 0.1, 0.3, 1],
'reg_lambda': [0, 1],
'nthread': [-1],
'seed': [1]
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')
  1. n_estimators is the number of trees
  2. max_depth is the maximum depth allowed for a single Decision Tree
  3. subsample is the ratio of training sample randomly drawn from the observations for each tree, hence 1 mean each tree use all observations.
  4. reg_alpha is the L1 regularization strength
  5. reg_lambda is the L2 regularization strength

This is the best parameter chosen by GridSearchCV :

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective='reg:linear', reg_alpha=1, reg_lambda=1,
scale_pos_weight=1, seed=1, silent=True, subsample=0.8)

Let’s see this sophisticated model in action :

Best GridSearchCV Score : 0.870888887778
Train MAE : 1798.93234188
Train R^2 : 0.999086657521
Test R^2 : 0.920996938653
Test MAE : 14229.4014207

It seems this model is famous for a reason. This is the highest R² we achieved so far in the test set. It seems to suffer a little overfitting though, the train set R² is almost perfect (0.999) and the gap between train & test set is significant : 1.7 k MAE vs 14.2k MAE.

So I try to combine some of the best parameters from above and strengthen an additional parameter : gamma. This parameter might be able to reduce overfitting by specifying the minimum loss reduction required to make a split in each tree.

parameters = {
'n_estimators': [100],
'max_depth': [6, 9, 12],
'subsample': [0.8, 0.9, 1],
'reg_alpha': [1],
'reg_lambda': [1],
'gamma': [0, 0.1, 0.3, 1],
'nthread': [-1],
'seed': [1]
}

Unfortunately, the GridSearchCV picked the same set of parameters as above :

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective='reg:linear', reg_alpha=1, reg_lambda=1,
scale_pos_weight=1, seed=1, silent=True, subsample=0.8)

Hence it returns the exact same score as above. We will try to optimize this model again in later iterations (with the help of feature selection to reduce the overfitting). I hope the model will perform a little bit worse on the train set, but better on the test set.

Second Iteration : Feature Engineering

You can check the complete code in DataPrepration.py

Our best R² score so far with basic features is 0.92 by XGBoost. SVM took the second place with 0.9 R² score.

Let’s try to add some new features. First, we will create new features by combining several existing features :

combine["OverallGrade"] = combine["OverallQual"] * combine["OverallCond"]
combine["GarageGrade"] = combine["GarageQual"] * combine["GarageCond"]
combine["ExterGrade"] = combine["ExterQual"] * combine["ExterCond"]
combine["KitchenScore"] = combine["KitchenAbvGr"] * combine["KitchenQual"]
combine["GarageScore"] = combine["GarageArea"] * combine["GarageQual"]
combine["TotalBath"] = combine["BsmtFullBath"] + (0.5 * combine["BsmtHalfBath"]) + \
combine["FullBath"] + (0.5 * combine["HalfBath"])
combine["AllSF"] = combine["GrLivArea"] + combine["TotalBsmtSF"]
combine["AllFlrsSF"] = combine["1stFlrSF"] + combine["2ndFlrSF"]
combine["AllPorchSF"] = combine["OpenPorchSF"] + combine["EnclosedPorch"] + \
combine["3SsnPorch"] + combine["ScreenPorch"]

Then we will re-rank the feature importance based on the Pearson correlation coefficient like we did in the 1st part of this post (this time we include the new features) :

OverallQual          0.819
AllSF 0.817
AllFlrsSF 0.729
GrLivArea 0.719
ExterQual 0.681
GarageCars 0.680
TotalBath 0.673
KitchenQual 0.667
GarageScore 0.657

Some of the new features we created seems to have a mild correlation with the target variable, such as TotalBath and GarageScore.

After that, let’s engineer a new feature based on the top features above. We will generate a square root, 2* and 3* polynomial for each of them :

combine["OverallQual-s2"] = combine["OverallQual"] ** 2
combine["OverallQual-s3"] = combine["OverallQual"] ** 3
combine["OverallQual-Sq"] = np.sqrt(combine["OverallQual"])
combine["AllSF-2"] = combine["AllSF"] ** 2
combine["AllSF-3"] = combine["AllSF"] ** 3
combine["AllSF-Sq"] = np.sqrt(combine["AllSF"])
combine["AllFlrsSF-2"] = combine["AllFlrsSF"] ** 2
combine["AllFlrsSF-3"] = combine["AllFlrsSF"] ** 3
combine["AllFlrsSF-Sq"] = np.sqrt(combine["AllFlrsSF"])
combine["GrLivArea-2"] = combine["GrLivArea"] ** 2
combine["GrLivArea-3"] = combine["GrLivArea"] ** 3
combine["GrLivArea-Sq"] = np.sqrt(combine["GrLivArea"])
combine["ExterQual-2"] = combine["ExterQual"] ** 2
combine["ExterQual-3"] = combine["ExterQual"] ** 3
combine["ExterQual-Sq"] = np.sqrt(combine["ExterQual"])
combine["GarageCars-2"] = combine["GarageCars"] ** 2
combine["GarageCars-3"] = combine["GarageCars"] ** 3
combine["GarageCars-Sq"] = np.sqrt(combine["GarageCars"])
combine["TotalBath-2"] = combine["TotalBath"] ** 2
combine["TotalBath-3"] = combine["TotalBath"] ** 3
combine["TotalBath-Sq"] = np.sqrt(combine["TotalBath"])
combine["KitchenQual-2"] = combine["KitchenQual"] ** 2
combine["KitchenQual-3"] = combine["KitchenQual"] ** 3
combine["KitchenQual-Sq"] = np.sqrt(combine["KitchenQual"])
combine["GarageScore-2"] = combine["GarageScore"] ** 2
combine["GarageScore-3"] = combine["GarageScore"] ** 3
combine["GarageScore-Sq"] = np.sqrt(combine["GarageScore"])

We can try PolynomialFeatures as an alternative. This class will generate the polynomial and interaction features (optional). I tried to generate 2* polynomial of all features (reduced by Lasso) using this class and the Linear Regression model will suffer from overfitting because the model will become too complicated (hence too much variance).

Second Iteration : Remove Outliers

The next thing we are going to do is remove the outliers. Recall that the most expensive house is 755000$ and it has a significant gap with the one at the 75th percentile : 755000–129975 = 625025.

We will check the distribution between the target variable with one of the most important feature : GrLivArea

scatter_liv_area()

We see that 2 observations at the bottom right are outliers : They have large area but sold at low price. In practice, we might need to confirm these data to other departments/teams. For example, there might be a human error when preparing this data.

The most expensive one looks reasonable in this graph (the top right dot).
However, I found out that the author of this dataset recommend to remove all observations with more than 4000 square feet. Hence, we will drop these 4 observations :

# make sure we need to drop 4 observations
>>> df_train[‘Id’][df_train.GrLivArea > 4000]
523 524
691 692
1182 1183
1298 1299
# drop above observations
>>> df_train = df_train[df_train.GrLivArea < 4000]

Our final df_train will have 1455 observations with 238 features. Let’s try to fit our new dataset to some models and compare the result.

I also decided to remove a single observation because it’s the only one with missing Electrical feature (not null but have “NA” value).

I found an interesting tips at this point : Log transform the skewed numerical features (at certain threshold) to lessen impact of outliers. In this project, we will log transform the features which skewness > 0.5 :

skewness = combine[numerical_features].apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
combine[skewed_features] = np.log1p(combine[skewed_features])

Second Iteration : Lasso

Let’s do L1 regularization again, with our new features :

model = LassoCV(
cv=10,
max_iter=50000,
n_jobs=-1,
random_state=1
)

The result :

Best Alpha : 66.8843545179
Train MAE : 13838.2377563
Train R^2 : 0.925993418504
Test R^2 : 0.916634410474
Test MAE : 15469.2533307

This iteration have better R² score both on train & test set, with less regularization strength (Alpha). The improvement is not significant (0.018 difference in R² test score) but it’s a good indicator that our new feature is working well.

As discussed in part 1, people tend to use smaller values of alpha (The default value of alpha is 1). So, I try to do GridSearchCV with much smaller list of alphas (0.0001 to 1), and the model return this result :

Best Alpha : 0.6
Train MAE : 12925.4936661
Train R^2 : 0.935888198983
Test R^2 : 0.901017189141
Test MAE : 16575.9871503

The result shows that a large difference in alpha (66.8 vs 0.6) don’t significantly affect the result for this dataset (0.015 difference in R² test score). The model did a great job finding the alpha automatically.

Second Iteration : SVM

The second iteration of SVM, with filtered features (by Lasso earlier) :

parameters = {
'C': [225000, 250000, 275000, 300000],
'kernel': ['rbf'],
'gamma': ['auto', 0.01, 0.1],
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')

Parameters chosen by GridSearchCV :

SVR(C=275000, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

The chosen C is larger than the previous iteration, but with smaller gamma (“auto” means 1/n_features = 1/238 = 0.0042). Let’s check the result :

Best GridSearchCV R^2 Score : 0.906964160403
Train MAE : 5534.178792
Train R^2 : 0.965740040734
Test R^2 : 0.939173671075
Test MAE : 12831.3393298

The R² test score is significantly improved by approximately 0.03 point. Recall that the best possible score for R² is 1 and we already achieved approximately 0.9 score earlier. This is the best R² score on the test set so far.

Last but not least, let’s fit the new training data to XGBoost model. Recall that this model is suffered with overfitting, so lets try to use filtered features from Lasso to reduce it :

parameters = {
'n_estimators': [100],
'max_depth': [3, 6, 9],
'subsample': [0.8, 0.9, 1],
'reg_alpha': [0, 0.1, 0.3, 1],
'reg_lambda': [0, 1],
'nthread': [-1],
'seed': [1]
}
grid_obj = GridSearchCV(model, parameters, cv=10, scoring='r2')

This is the chosen combination of parameters :

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective='reg:linear', reg_alpha=1, reg_lambda=0,
scale_pos_weight=1, seed=1, silent=True, subsample=0.8)

The maximum depth of the tree have go down from 9 to 3. This is a great news for us because the model suffered from overfitting in the 1st iteration (large gap between train and test set).

Best GridSearchCV Score : 0.905190288583
Train MAE : 9744.92443218
Train R^2 : 0.970448412695
Test R^2 : 0.933626444213
Test MAE : 14021.5986899

The result confirmed that the model will generalize better. The train R² score have gone down (from almost perfect score of 0.999) and the test R² score have gone up a little (0.01).

Conclusion

First things first, let’s summarize what we have explored :

  1. The importance of feature engineering and outliers removal on model performance.
    Checked. Creating new features and removing 4 outliers increased the performance of Lasso, SVM & XGBoost.
  2. The benefit of regularization (L1, L2 and ElasticNet) to reduce overfitting.
    Checked. Regularization is a no-brainer for this dataset (and for most data, I presume). We compared the strength of L1 & L2 regularization on this dataset and found out L1 regularization is better (more sparse data). I think this is related to the OneHotEncoding method we used data prepration, hence we have a lot of binary features that can be reduced.
  3. The benefit of feature selection (by Lasso/L1) to reduce overfitting.
    Checked. Reducing the number of features we passed on the Linear Regression model succesfully prevent the overfitting.

Note that while the best parameters and result I showed here is real (you can run the source code in your machine), the process I wrote in this post is a simplification of what I did when practicing with this dataset. For example, I did more than 2 iterations because I tried feature engineering and outlier removal in different iteration.

There are many ways we could try to improve the score, for example :

  1. Try different method to handle missing data
  2. Try different method of feature selection
  3. Try more parameters in GridSearchCV (yes, I’m talking to you XGBoost)

And so on.

Even though I said that the main goal of this project is to practice several theories of machine learning, I do pay attention to the score instinctively. The best R² score of the test set we got so far is 0.939 from SVM. This project ranked 669 of 2111 in Kaggle, with 0.12296 root mean squared logarithmic error at the time of this writing. The leaderboard is very competitive, hence a little improvement might increase your rank significantly.

Thank you for reading this article. Please kindly post your suggestion in the comment, and don’t hestitae to point out my mistake(s) on this post, I will update this post to correct them.