Philadelphia Housing Data Part-III: Machine Learning

16 min readApr 15, 2018

(III). Machine Learning: Modeling

In Part-I we studied the features of the Philadelphia housing data, performed some data pruning and dealt with the NaNs. Then in Part-II we converted the categorical variables into the dummy variables. We also applied various techniques to decrease the skewness of the skewed numerical variables. Thus after end of Part-II, our data set is ready for the machine learning algorithms. Using Scikit-Learn (Ridge, Lasso, Random Forests) and XGBoost (Gradient Boosting Trees), we shall perform regression analysis and predict house prices on the Philadelphia housing data set. [In Part-IV, I shall use Keras-TensorFlow (Deep Neural Networks) to train and test on this data set.]

Note: Each of these algorithms (especially Random Forests and XGBoost) will take a very long time and use a lot of memory. Also note: In the codes below, ignore the vertical bar (|) — it is there simply to line the scripts (as it seems there is no way to customize margins and indents). So please do not include ‘|’ in your actual code.

(III-A). Finding optimum parameters for each algorithm:

Below, we select the optimum parameters for each of the algorithms using k-fold cross validation. To prevent the algorithms from snooping over the actual test data, we restrict the training and testing of the algorithms on the original training set (X_train and y_train — see (II-C-5) in Part-II). We use Scikit-Learn’s cross-validation to randomly and repeatedly split the training set into ‘training’ and ‘validation/test’ set. The algorithm learns on the multiple ‘training’ sets and uses the multiple ‘validation/test’ sets to make predictions. During this phase of parameter finding, the process of learning and testing does not involve the actual test set (X_test and y_test — see (II-C-5)). We do not want our algorithms to look at the test set until we are ready for final prediction. This is extremely important, so as to avoid data snooping [here is a good video which explains how to avoid data snooping]. Once we find the optimum parameters, then we can use the test set [see III-B below].

Because in this parameter stage we are only using X_train and y_train, let’s redefine it so that we do not have to worry about the ‘_train’ part:

X = X_train
y = y_train

(III-A-1). Linear Regression with L-2 regularization (Ridge Regression): Here, we are looking for the optimum alpha parameter. Ridge algorithm penalizes the L-2 norm of the coefficients in the linear model, thus reducing complexity in the model. This way we avoid overfitting (overfitting implies higher training score — because the model has greatly learned on the data, but much lower test score — it is the test score that we finally care about and hence we need to avoid overfitting). Higher alpha parameter means more constraint on the model (makes some of the non-important coefficients zero) thereby decreasing the model’s complexity. This way alpha parameter helps in regularization by explicitly restricting a model so as to avoid overfitting. Below I use cross_val_score function to perform cross-validation and select the optimum alpha. To scale the data with mean of 0 and variance of 1, I use StandardScaler. While scaling the data, it is very important that the fit method is applied only on the training set. Below I highlight this concept using a general code snippet.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Perform 'fit' on the training set
scaler.fit(X_train)
# Transform/scale both the training and the test set
scaler.transform(X_train)
scaler.transform(X_test)

In this work, scaling is done using the Pipeline class. Pipeline chains together the scaler and the algorithm/model. The Pipeline object is passed to the ‘cross_val_score’ function. The ‘cross_val_score’ function splits the original training set using cross-validation into train and validation sets. Pipeline performs the ‘fit’ method for StandardScaler on this train set. Pipeline also builds the model on the train set, leaving the validation set for scoring.

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScalerimport time # to keep track of the computational time
start_time = time.time()##################################################################### USING RIDGE REGRESSON (L-2 penalty):
# I. For coarse alpha value selection:
alphas = [0.01, 0.1, 1, 10] # default alpha = 1
best_r2_initial = 0for i in alphas:
    scaler = StandardScaler() # to make the mean=0 and variance = 1
    ridge = Ridge(alpha=i, random_state=42)# "42: the answer to the  |   # ultimate question of life, the Universe, and everything!"
    pipe = Pipeline([('transformer', scaler), ('estimator', ridge)])    
    r2_score = cross_val_score(pipe, X, y, cv=10, scoring='r2')
    # cv=10 for 10 fold validation
    mean_r2 = r2_score.mean()
    
    # If we get a better r2 score, store that score and the
|   # corresponding alpha value
    if mean_r2 > best_r2_initial:
        best_r2_initial = mean_r2
        best_alpha_initial = i        
    
print("\nThe best r2 score is {:.3f} corresponding to best alpha{}".format(best_r2_initial, best_alpha_initial))print("\n-------------------------------------------------------\n")# II. For finer alpha value selection:
alpha = best_alpha_initial
alphas = [alpha*0.3, alpha*0.4, alpha*0.5, alpha*0.6, alpha*0.7, alpha*0.8, alpha*0.9, alpha*1.05, alpha*1.1, alpha*1.2, alpha*1.3, alpha*1.4]
mean_list = []
std_list_upper = []
std_list_lower = []
best_r2 = 0
best_rmse = 1for i in alphas:
    scaler = StandardScaler()
    ridge = Ridge(alpha=i, random_state=42)
    pipe = Pipeline([('transformer', scaler), ('estimator', ridge)])
    
    # Using two scoring parameters to evaluate alpha: R² and 
|   # RMSE(Root Mean Square Error)
    r2_score = cross_val_score(pipe, X, y, cv=10, scoring='r2')
    mean_r2 = r2_score.mean() # mean R² 
    
    error_score = cross_val_score(ridge, X, y, cv=10,                   |   scoring='neg_mean_squared_error')
    
    rmse_score= np.sqrt(-error_score)# the output 'error_score' is 
|   # negative, thus, using -error_score
    mean_rmse = rmse_score.mean() # mean RMSE
    
    print('alpha: {:.5f}'.format(i))    
    print("R^2 score for k=10 fold validation: {}".format(
            np.around(mean_r2, decimals=3)))    
    print("Root Mean squared error for k=10 fold validation:         |   {}".format(np.around(mean_rmse, decimals=3)))    
    print('')
        
    # If we get a better r2 and rmse score, store that score and the  |   # alpha value
    if mean_r2 > best_r2 and mean_rmse < best_rmse:
        best_r2 = mean_r2
        best_rmse = mean_rmse
        best_alpha = i
        
print("\n-------------------------------------------------------\n")
    
print("The best R² score is {:.3f}".format(best_r2) +
      " and RMSE score is {:.3f}".format(best_rmse) +
      " corresponding to the best alpha={:.3f}.".format(best_alpha))print('\n-------------------------------------------------------\n')# IMPORTANT FEATURES SELECTED BY THE RIDGE:
ridge = Ridge(alpha=best_alpha, random_state=42)
ridge.fit(X, y) 
coefs = pd.Series(ridge.coef_, index = X.columns)
print("Ridge picked " + str(sum(coefs != 0)) + " features and 
|      eliminated the other " + str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(20),
                     coefs.sort_values().tail(20)])# Plot important coefficients    
plt.figure(figsize=(9,17))    
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Ridge Model")
plt.ylabel("Features")
plt.xlabel("Coefficients")
plt.show()####################################################################print('\n---------------------------------------------------------')seconds = time.time() - start_time
print("--- %s seconds ---" % seconds)m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
print ("%d:%02d:%02d" % (h, m, s))

RESULT:
The best R² score is 0.912 and RMSE is 0.100 corresponding to the best alpha 3.0. Ridge picked 755 features and eliminated the other 27 features.

Below are the coefficients (features) selected by the Ridge model.

(III-A-2). Linear Regression with L-1 regularization (LASSO):

The LASSO (Least Absolute Shrinkage and Selection Operator) algorithm uses L-1 norm of the coefficients in the linear model. This is a more restrictive model. The code is similar to the above Ridge, except, substitute

from sklearn.linear_model import Lasso
lasso = Lasso(alpha=i, max_iter=50000) # very low alpha value may  
                                       # need more iterations
pipe = Pipeline([('transformer', scaler), ('estimator', lasso)])

in place of the similar expressions in the above Ridge code.

RESULT:
The best R² score is 0.921 and RMSE is 0.096 corresponding to best alpha 0.0001. Lasso picked 212 features and eliminated the other 774 features

First, we observe that the R² score and RMSE using the Lasso model is a bit better then the one using the Ridge model. We also see that Lasso utilized a lot fewer features (212) to build the model compared to Ridge (which used 755 features). Below are the coefficients (features) selected by Lasso.

Comparing the above coefficients selected by Lasso with that from Ridge (Fig. 1), we note that both Lasso and Ridge picked year built from 2012 — 2017 to be important in their learning. On the other hand, Ridge chose ‘total_area’ to be the most important feature in learning, while Lasso ignored it.

(III-A-3). : Random Forest Regressor:

For our next algorithm, we use an ensemble method where a large collection of decision trees is used to avoid overfitting. The main parameters in RandomForestRegressor are: Number of trees (n_estimators), max_features and depth. As a rule of thumb, the more trees you have the better results you shall get, especially in terms of reducing overfitting— although, after a certain number of trees, you hit a wall of diminishing returns. Here, I shall use n_estimators=5000, but you may use 1000 trees if you lack robust CPU and a lot of RAM, as random forests learning is computationally intensive. The next parameter, max_features, determines how random each tree is. You want the (decision) trees in the forest to not look the same. The more alike the trees are, the higher chance you have of overfitting. Therefore, you want max_features to not exceed the total number of features (n_features) in your data. It is a good rule of thumb to use the default values here, i.e. max_features = n_features. Thus we won’t worry about changing this parameter. The final parameter, that we are concerned with is depth. This parameter is used for pre-pruning the trees. Pre-pruning trees can help in reducing overfitting. But we do not want to prune the trees too much, or else we won’t get good results. Hence, we try to find the optimum depth for the trees. A final note: Decision trees (like random forests and gradient boosted trees) are invariant to data scaling. Consequently, we don’t need to scale our data set when using these algorithms. Therefore, we won’t be using StandardScaler and Pipeline here.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
#################
# USING RANDOM FOREST REGRESSOR:depth = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
mean_forest = []
mean_list = []
best_r2 = 0
best_rmse = 1for i in depth:
    forest = RandomForestRegressor(n_estimators=5000, max_depth=i, 
                                   random_state=42, n_jobs=-1) 
    # n_jobs=-1 uses all the available CPU cores
    
    r2_score_forest = cross_val_score(forest, X, y, cv=10,                      |                                     scoring='r2')
    mean_r2 = r2_score_forest.mean()
    
    error_score_forest = cross_val_score(forest, X, y, cv=10,                                                    |                              scoring='neg_mean_squared_error')
    rmse_score= np.sqrt(-error_score_forest)
    mean_rmse = rmse_score.mean()    print('depth:',i)
    
    print("R^2 score for k=10 fold validation: {}".format(
           np.around(mean_r2, decimals=3)))
    
    print("Root Mean squared error for k=10 fold validation:         |   {}".format(np.around(mean_rmse, decimals=3)))    
       
    print('')
    # If we get a better r2 and rmse score, store that score and the                |   # depth value
    if mean_r2 > best_r2 and mean_rmse < best_rmse:
        best_r2 = mean_r2
        best_rmse = mean_rmse
        best_depth = i# VISUALIZING THE IMPORTANT FEATURES SELECTED BY RANDOM-FOREST:
'''
Attribute (of Random Forest), 'feature_importances_' contains the ratings (between 0 and 1) of how important each feature is for the decision a tree makes. A rating of 0 means that the feature is not used in decision making at all. All feature importances sum to 1.
'''# Rebuild a random-forest model based on the best data from above:
forest = RandomForestRegressor(n_estimators=5000, 
|                            max_depth=best_depth, random_state=42)forest.fit(X, y) 
important_features = pd.Series(data=forest.feature_importances_,
                               index=X.columns)
important_features.sort_values(ascending=False, inplace=True)
selected_features = important_features[0:5] # selecting only 5 
# features as most of the features are zero
# The code for plotting is similar to the one shown above for Ridge

RESULT:
The best R² score is 0.9998 and RMSE is 0.011 corresponding to the best depth of 15

Below are the features selected by Random Forest Regressor.

From the above figure, we see that Random Forest Regressor used only 3 features in its decision making! These three features are also highly correlated with ‘market_value’ (see Fig. II-1)

The next ensemble model we use is the gradient boosted regression trees. This model combines many shallower, simple models/trees. More trees are then added iteratively to improve the performance. Each new tree tries to correct the mistakes from the previous tree using the parameter ‘learning_rate’, thereby improving the result. In this essay, instead of using the Scikit-Learn’s version of boosted trees (GradientBoostingRegressor), I shall use XGBoost version. For a large scale data like ours, XGBoost regressor is better. It is faster and has a lot of parameters that we can tune to get better results and avoid overfitting. I shall use XGBRegressor, which is the Sci-Kit Learn wrapper interface for XGBoost. One of the best attributes of XGBoost is its support for regularization. In that vein, I first classify XGBRegressor into two parts: One with L-1 regularization (Lasso) and another one with L-2 regularization (Ridge). We have already encountered these regularizations above while performing linear regression. As this is a tree model, we do not need to scale our data. I use GridsearchCV class to perform a grid-search over the given parameter(s) and also perform cross-validation. The results are stored in GridSearchCV’s object’s attributes. We utilize similar parameter tuning approach as mentioned here and here:

We start with a relatively high learning rate of 0.2 (default learning rate is 0.1). Higher ‘learning_rate’ and lower ‘n_estimators’ allow the algorithm to learn faster on the data set. (I shall fine tune and decrease the ‘learning_rate’ in the last step — a smaller ‘learning_rate’ allows the algorithm to avoid overfitting.) I shall keep the other parameters [‘max_depth’, ‘min_child_wt’, ‘subsample’, ‘colsample_bytree’, ‘gamma’ and ‘alpha’ (for L-1)/‘lambda’ (for L-2)] at their default values and tune ‘n_estimators’ around its default value of 100.
Tune ‘max_depth’ and ‘min_child_wt’. We already saw the parameter ‘max_depth’ in Random Forests. One of the differences between Random Forests and Gradient Boosted trees is the depth of each tree. Random Forests usually require a higher tree depth, while Gradient Boosted algorithm uses many shallower trees.The parameter ‘min_child_wt’ concerns with the minimum weight required in order to create a new node in the tree. A low ‘min_child_wt’ produces smaller trees making a model more complex, which may lead to overfitting. Thus we have to tune the two parameters together to find a good balance between added complexity (which may make the score better) without overfitting.
We can control the number of samples to use during learning phase with the help of parameters ‘subsample’ and ‘colsample_bytree’. The default value for ‘subsample’ is 1 (i.e. all the rows are used during learning phase) and for ‘colsample_bytree’, the default is also 1 (i.e. all the columns/features are used).
Next we tune the parameter ‘gamma’. It controls the complexity of the model and hence works like a regularization parameter. The default value is 0 (i.e. no regularization). As we are already using another regularization parameter (alpha/lambda), tuning ‘gamma’ may not be that crucial.
After that, we tune alpha (in case of L-1 regularization) or lambda (for L-2 regularization). These are the same parameters that we faced above while working on Lasso and Ridge models.
Finally we lower and tune ‘learning_rate’ and increase and tune ‘n_estimators’ again. These two parameters are interconnected. Lowering ‘learning_rate’ requires more trees to build a model of similar complexity. In contrast to Random Forests, using high ‘n_estimator’ without proper tuning can result in overfitting. Higher ‘learning_rate’ helps each tree make stronger corrections to the mistakes of the previous trees, resulting in an increase in model complexity, which again, may cause overfitting.
I use the default booster ‘gbtree’ and default objective ‘reg:linear’

(III-A-4). : XGBoost — XGBRegressor with L-1 penalty (like LASSO):

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
################
# USING XGBOOST REGRESSOR:
# L-1 REGULARIZATION [like LASSO] (therefore, reg_lambda = 0)# (1) Changing the parameter - n_estimators:
n_estimators = [50, 100, 300, 500] # default = 100 treesgbt_L1_estimator = xgb.XGBRegressor(learning_rate=0.2, silent=False, 
                       random_state=42, reg_lambda = 0, n_jobs=-1)
#'silent=False' so we can see the progress during computation # parameter dictionary
params_estimators = {"n_estimators" : n_estimators}cv_estimator = GridSearchCV(gbt_L1_estimator, param_grid = params_estimators, verbose=3, cv=10, scoring='r2', return_train_score=True) # verbose=3 - for more messagescv_estimator.fit(X,y)
'''
Using the attributes of the object 'cv_estimator' to extract the results, get the best parameters selected during the grid search and the best (R^2) score
'''
results_estimator = pd.DataFrame(cv_estimator.cv_results_)
print(results_estimator)
print("\n")
best_parameter_estimator = cv_estimator.best_params_
print(best_parameter_estimator)
print("\n")
print("Best R^2 score: {}".format(cv_estimator.best_score_))print("\n_______________________________________________________\n")# (2) Changing parameter - "max_depth" and "min_child_wt":
max_depth = [2, 3, 4, 5, 6, 7, 8] # default = 3
min_child_wt = [0.5, 1.0, 2.0, 3, 4, 5, 6] # default = 1gbt_L1_depth_wt = xgb.XGBRegressor(learning_rate=0.2, **best_parameter_estimator, silent=False, random_state=42, reg_lambda = 0, n_jobs=-1)params_depth_wt = {"max_depth" : max_depth, "min_child_weight" : min_child_wt}cv_depth_wt = GridSearchCV(gbt_L1_depth_wt, param_grid = params_depth_wt, verbose=3, cv=10, scoring = 'r2', return_train_score=True)cv_depth_wt.fit(X,y)results_depth_wt = pd.DataFrame(cv_depth_wt.cv_results_)
print(results_depth_wt)
print("\n")
best_parameter_depth_wt = cv_depth_wt.best_params_
print(best_parameter_depth_wt)
print("\n")
print("Best R^2 score: {}".format(cv_depth_wt.best_score_))print("\n______________________________________________________\n")# (3) Changing parameter: "subsample" and "colsample_bytree":
subsample = [0.5, 0.8, 1.0] # default = 1
colsample = [0.5, 0.8, 1.0] # default = 1gbt_L1_sample = xgb.XGBRegressor(learning_rate=0.2, **best_parameter_estimator, **best_parameter_depth_wt,                       silent=False, random_state=42, reg_lambda = 0, n_jobs=-1)params_sample = {"subsample" : subsample, "colsample_bytree" : colsample}cv_sample = GridSearchCV(gbt_L1_sample, param_grid = params_sample, verbose=3, cv=10, scoring = 'r2', return_train_score=True)cv_sample.fit(X,y)results_sample = pd.DataFrame(cv_sample.cv_results_)
print(results_sample)
print("\n")
best_parameter_sample = cv_sample.best_params_
print(best_parameter_sample)
print("\n")
print("Best R^2 score: {}".format(cv_sample.best_score_))print("\n______________________________________________________\n")# (4) Changing parameter: "gamma":
gamma = [0, 0.01, 0.1] # default = 0gbt_L1_gamma = xgb.XGBRegressor(learning_rate=0.2, **best_parameter_estimator, **best_parameter_depth_wt, **best_parameter_sample, silent=False, random_state=42, reg_lambda = 0, n_jobs=-1)params_gamma = {"gamma" : gamma}cv_gamma = GridSearchCV(gbt_L1_gamma, param_grid = params_gamma, verbose=3, cv=10, scoring = 'r2', return_train_score=True)cv_gamma.fit(X,y)results_gamma = pd.DataFrame(cv_gamma.cv_results_)
print(results_gamma)
print("\n")
best_parameter_gamma = cv_gamma.best_params_
print(best_parameter_gamma)
print("\n")
print("Best R^2 score: {}".format(cv_gamma.best_score_))print("\n______________________________________________________\n")# (5) Changing parameter: "alpha" (for L-1 regularization):
alpha = [0.0001, 0.001, 0.01, 0.1, 1] # default=0 (Lasso default=1)gbt_L1_alpha = xgb.XGBRegressor(learning_rate=0.2, **best_parameter_estimator, **best_parameter_depth_wt, **best_parameter_sample, **best_parameter_gamma, silent=False, random_state=42, reg_lambda = 0, n_jobs=-1)params_alpha = {"reg_alpha" : alpha}cv_alpha = GridSearchCV(gbt_L1_alpha, param_grid = params_alpha, verbose=3, cv=10, scoring = 'r2', return_train_score=True)cv_alpha.fit(X,y)results_alpha = pd.DataFrame(cv_alpha.cv_results_)
print(results_alpha)
print("\n")
best_parameter_alpha = cv_alpha.best_params_
print(best_parameter_alpha)
print("\n")
print("Best R^2 score: {}".format(cv_alpha.best_score_))print("\n______________________________________________________\n")# Lowering the learning rate and adding more trees.
# (6) Changing parameter: "learning_rate" and "n_estimators":
learning_rate = [0.001, 0.01, 0.1] # default = 0.1
n_estimators = [100, 300, 500, 1000] # default = 100gbt_L1_rate = xgb.XGBRegressor(**best_parameter_depth_wt, **best_parameter_sample, **best_parameter_gamma, **best_parameter_alpha, silent=False, random_state=42, reg_lambda = 0, n_jobs=-1)params_rate_estimator = {"learning_rate" : learning_rate, 
                         "n_estimators" : n_estimators}cv_rate_estimator = GridSearchCV(gbt_L1_rate, param_grid = params_rate_estimator, verbose=3, cv=10, scoring = 'r2', return_train_score=True)cv_rate_estimator.fit(X,y)results_rate_estimator = pd.DataFrame(cv_rate_estimator.cv_results_)
print(results_rate_estimator)
print("\n")
best_parameter_rate_estimator = cv_rate_estimator.best_params_
print(best_parameter_rate_estimator)
print("\n")
print("Best R^2 score: {}".format(cv_rate_estimator.best_score_))print("\n_______________________________________________________\n")# Best parameters chosen:
print(best_parameter_depth_wt)
print(best_parameter_sample)
print(best_parameter_gamma)
print(best_parameter_alpha)
print(best_parameter_rate_estimator)print("\nBest final R² score: {}".format(cv_rate_estimator.best_score_))print("\n_______________________________________________________\n")# VISUALIZING THE IMPORTANT FEATURES SELECTED BY XGBOOST:
# Attribute (of XGBOOST) in scikit-learn wrapper is the same: 
# 'feature_importances_'# Rebuilding XGBOOST model based on the best parameters from above:gbt_L1 = xgb.XGBRegressor(**best_parameter_depth_wt, **best_parameter_sample, **best_parameter_gamma, **best_parameter_alpha, **best_parameter_rate_estimator, random_state=0)gbt_L1.fit(X, y)important_features = pd.Series(data=gbt_L1.feature_importances_, 
                               index=X.columns)
important_features.sort_values(ascending=False,inplace=True)
selected_features=important_features[0:20]

The results for L-1 regularization are:

RESULTS:
Best final R² score: 0.9996
Best Parameters: {“n_estimators” : 1000, “learning_rate” : 0.1
“max_depth” : 6, “min_child_weight” : 5, “subsample” : 1.0, “colsample_bytree” : 1.0, “reg_alpha” : 0.1, “gamma” : 0}

Below are the features selected by XGBoost Regressor with Lasso (L-1) regularization.

(III-A-5). : XGBoost — XGBRegressor with L-2 penalty (like Ridge):

The format for XGBRegressor with L-2 regularization is similar to above code, except, in here, ‘reg_alpha’ = 0 and we are tuning ‘reg_lambda’ (the range of parameter values for tuning is the same as above). Rest everything is same as above for the L-1 case. The results with L-2 regularization are:

RESULTS:
Best final R² score: 0.9999
Best Parameters: {“n_estimators” : 1000, “learning_rate” : 0.1
“max_depth” : 6, “min_child_weight” : 5, “subsample” : 1.0, “colsample_bytree” : 1.0, “reg_lambda” : 0.1, “gamma” : 0}

Looking at the best parameter dictionary for both L-1 and L-2 cases, we see that the values chosen, in both cases, for ‘subsample’, ‘colsample_bytree’ and ‘gamma’ are their respective default values!

Below are the features selected by XGBoost Regressor with Ridge (L-2) regularization.

Comparing Figures 4 and 5, we see that XGBoost Regressor with both L-1 and L-2 regularization selected almost the same features. Furthermore, these selected features are also highly correlated with ‘market_value’ (see Fig. II-1). Thus it seems that XGBoost (and Random Forests) selected more correlated features compared to the important coefficients selected by Ridge and Lasso models. Furthermore, the top three features in Figures 4 and 5 are same as the features selected by Random Forest Regressor (see Fig. 3). This may be one of the reasons why all the three models produced much better results compared to Ridge and Lasso.

So in general, the ensemble models (Random Forests and XGBoost) produced better results — XGBoost focused on more features compared to Random Forests, producing a bit better, a more fine grained result.

(III-B). Making predictions on the test set:

From the above evaluations, we select the three models: Random Forests and XGBoost (both L-1 and L-2) to make prediction on the original test set, as these models produced much better results compared to Ridge and Lasso models. Using the best parameter chosen from above evaluations, we again build the three models — the model will now learn on the actual X_train and y_train data. Then, the predictions are made on the original test set (X_test and y_test from II-C-5 in Part-II). We evaluate the results by looking at the R² score for each model.

(III-B-1). Using Random Forests Regressor:

forest = RandomForestRegressor(n_estimators=5000, max_depth=15, random_state=0)forest.fit(X_train, y_train) # fitting/learning on the training setscore_test = forest.score(X_test, y_test) # score on the test set
print("\nFinal R² score on the TEST SET is: {}\n".format(score_test))

Final R² score on the TEST SET is: 0.99889

(III-B-2). Using XGBoost — XGBRegressor with L-1 penalty (like LASSO):

xgb_lasso = xgb.XGBRegressor( booster='gbtree', n_estimators=1000,  learning_rate=0.1, max_depth=6, min_child_weight=5, gamma=0, subsample=1.0, colsample_bytree=1.0, objective='reg:linear', reg_alpha=0.1, reg_lambda=0, silent=False, random_state=0)xgb_lasso.fit(X_train, y_train)score_test = xgb_lasso.score(X_test, y_test)

Final R² score on the TEST SET set is: 0.99961

(III-B-3). Using XGBoost — XGBRegressor with L-2 penalty (like Ridge):

xgb_ridge = xgb.XGBRegressor( booster='gbtree', n_estimators=1000,  learning_rate=0.1, max_depth=6, min_child_weight=5, gamma=0, subsample=1.0, colsample_bytree=1.0, objective='reg:linear', reg_alpha=0, reg_lambda=0.1, silent=False, random_state=0)xgb_ridge.fit(X_train, y_train)score_test = xgb_ridge.score(X_test, y_test)

Final R² score on the TEST set is: 0.99960

From these evaluations, we see, again that all the three models produce almost similar results. Their R² score is almost near to 1 — meaning that we get almost perfect predictions! Therefore, we can use any of the three models to make prediction on any future Philadelphia housing data.

In the next part (Part-IV), I shall use Keras-TensorFlow Deep Learning Neural Networks model to predict on this same data set by performing the similar procedure, as highlighted in this section.

Philadelphia Housing Data Part-III: Machine Learning

(III). Machine Learning: Modeling

Written by Nikhilesh A. Vaidya