Tama Handika
14 min readNov 3, 2017

Practicing Regression Techniques on House Prices Dataset-Part 1

Hello All,

In this post, I will tell you about my adventure on learning Machine Learning. We will explore House Prices dataset from Kaggle (source). This is a regression problem where we need to predict the house’s prices based on 79 features (variables).

I will explain what I did and why I did it step by step in this article. My goal on exploring this dataset is to get a hands on experience on Machine Learning and to prove some theories & concepts I learned but never experienced myself. The Kaggle rank & score is my second priority (although getting a top rank will be cool).

Note that I’m still learning myself, and I’m not a practicing Data Scientist (yet). This journal is targeted for people that just getting started in Machine Learning field. I’m open for any tips and suggestion.

You can download the complete source code here

This post will explore the following concepts :

  1. The importance of feature engineering and outliers removal on model performance
  2. The benefit of regularization (L1, L2 and ElasticNet) to reduce overfitting
  3. The benefit of feature selection (by Lasso/L1) to reduce overfitting
  4. And more details. Just keep reading :)

I will conclude the result of this “experiment” in the Conclusion.

Outline

  1. Data Exploration
  2. Data Preparation (handling missing values, feature engineering, etc)
  3. Training Models (Parametric & Non-Parametric)
  4. Conclusion

We will repeat the 2nd & 3rd point to prove the concepts above. For example, we will prove the importance of feature engineering by comparing the model performance before and after doing feature engineering.

Because the length of this post (which I don’t plan to be this long), I will split this post to 2 parts. This first part (this post) will talk from Data Exploration to training Parametric Models (the 1st iteration). Non-parametric models and more iterations will be on part 2.

  1. Data Exploration

First, lets check the number of observations and features on this dataset :

>>> df_train = pd.read_csv(‘your_path’)
>>> df_train.shape
(1460, 81)

We could see there are 1460 observations and 79 features (minus the Id and SalePrice). Our target variable is the rightmost feature, called SalePrice.

Lets take a peek on our beloved dataset :

df_train.head()

Check some statistical variables of each features :

df_train.describe()

Let’s zoom in to the target variable :

SalePrice
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000

There are some basic facts of the target variable that we could derive from the describe method above :

  1. It has a wide range : 75500–34900 = 406000
  2. It has large standard deviation
  3. The gap between median and mean is not large: 180921–163000 = 17921
  4. The maximum price have a large gap compared to the price at the end of 3rd quartile :
    755000–129975 = 625025
    We will check this particular observation (the most expensive one) later to decide whether it is an outlier or not.

And we could check the data type of each features by using this command :

df_train.info()

Using above commands sometimes enough to give use basic intuition of the dataset if we are working with a dataset with small number of features (such as the Titanic dataset).

However, this dataset have 79 features so I prefer using spreadsheet to scan through the features. I created a spreadsheet and fill the following fields on each feature :

  1. Data Type (Numerical, Categorical or Ordinal)
  2. Expectation of feature importance (Low, Med, High)

We could see that this dataset have 33 numerical , 31 categorical, and 15 ordinal features. Knowing the data type of these features will come handy when we do Standardization and OneHotEncoding later.

I add the expectation of each feature just to enhance my judgement and understanding about each feature with respect to the target variable (SalePrice).

We could also see that a lot of variables are highly correlated with each other (i.e. GarageQual and GarageCond). Intuitively, I think this dataset is a good example to practice regularization.

This is the spreadsheet I created for this dataset

Data Exploration : Visualization

Lets create some graphs to see how the data is distributed. The feature that I expect will affect the SalePrice the most is OverallQuall. Lets see the relation of both of them :

boxplot_custom_by_sales(‘OverallQual’)

We could see the positive correlation of OverallQual and SalePrice as expected, with some outliers. We will deal with some extreme outliers later on this post.

Next, lets check the importance of SaleCondition (the name of this feature draw my attention). I expect abnormal SaleCondition will be lower than the normal one :

pointplot_custom_by_sales(‘SaleCondition’)

The SaleCondition seems to affect the SalePrice relatively well. Houses that has Partial SaleCondition tend to have higher prices than other conditions.

There area lot of features we could visualize, so I created some basic methods with a feature as a parameter for flexibility. If you download the source code, you can check DataExploration.py and call those methods based on your needs.

This article will be extremely long if I write all things and graphs that I explored. I keep exploring the data until I feel comfortable with it. I also use Spreadsheet to do some simple analysis because this is a relatively small dataset. (i.e. How many houses in our observation that have a swimming pool?)

You could generate the graphs by executing the code in the caption

Data Exploration : Univariate Analysis

Univariate Analysis is useful to check which features correlated the most with target variable. We will use Pearson method to calculate the correlation coefficient. The range of the coefficient will be between -1 and 1.

The negative coefficient mean the feature is negatively correlated with the target variable. In DataExploration.py, I use corr method from pandas library and visualize the result with a heatmap :

gen_top10_corr_matrix()

As expected, the OverallQual lives up to its name. GrLiveArea also have significant correlation with the target variable. On the other hand, we could see that GarageArea and GarageCars are highly correlated (as expected, more cars need more parking area). We will use this information later.

Having too many highly correlated features could decrease the performance of the model. One strategy is to reduce the redundant features by using Random Forest. One advantage of using Random Forest is it will punish redundant features if similar feature has been chosen.

For example, if GarageArea has been chosen, GarageCars will have very low feature importance.

Although Random Forest will probably work well on this dataset, I will focus on Lasso method on this post because I have tried using Random Forest as feature selection method on another dataset.

Note that my priority is to explore and prove the concepts of Machine Learning, not to get the best Kaggle Score (although better score will be great).

I might get back and try to compare Random Forest with Lasso next time.

2. Data Preparation

We will do 6 things on this part :

  1. Handle missing data
  2. Handle Outliers (Later, on the second iteration)
  3. Converting Categorical and Ordinal data to Int
  4. Standardize data
  5. Feature Engineering (Later, on the second iteration)
  6. Split data to training and testing set (Cross Validation)

We will leave out feature engineering on the first iteration to see whether feature engineering improve our model performance. We will also handle the outliers later.

In practice, it might be wise to remove outliers and do feature engineering right from the start to save up time.

Data Preparation : Handle Missing Data

We have some options to handle missing data, such as :

  1. Remove the column entirely
  2. Remove the observation (row)
  3. Impute the missing value by mean, mode, or other advanced methods
  4. Flag the missing value, for example ‘NA’ or -1

The first option is a good candidate if most of the data are missing (hence imputing mean/mode will not make sense). The second option is great if the majority of feature’s value are not missing. However, we should drop the features or observations with care because we might lose importance information for our model.

The third option is a very popular choice, since we could be creative with the imputation method. And the fourth option is a safe choice to not break the variance of training data. I saw some people use the 4th option with heavily missing data.

I use the combination of those 4 choices in this project. These are some features with missing values :

If we look these features closely, we will see that some of them are relatively redundant (highly correlated) with other non-missing features. Hence I choose to drop these missing features because there are some features we could use to replace those. For Example, we could use GarageCond to replace GarageYrBlt.

I also decided to remove a single observation (the 2nd option) because it’s the only one with missing Electrical feature (not null but have “NA” value).

Handling missing data on Kaggle dataset is a little bit complicated because missing data on the training set is different with those in the prediction set.
In this post, I will talk about the training set only. You could see the complete code in DataPrepration.py (I use the other 3 methods for the prediction set).

Retaining those features by using different imputing techniques might yield a better result. However, in this post we will focus on the effect of Feature Engineering.

Data Preparation : Converting Categorical and Ordinal data to Int

It is considered a best practice to convert categorical and ordinal features to numbers/integers. Ordinal features are features whose have a rank/order in its values (i.e. GarageQual)

We could use LabelEncoder to conveniently convert ordinal String data to integers :

label_encoder = LabelEncoder()
combine['GarageQual'] = label_encoder.fit_transform(combine['GarageQual'].values)

For categorical features that has no clear rank in its values (i.e. SaleCondition), we shouldn’t use LabelEncoder because the model might infer that a certain value is larger than another, based on the order of the values. For example, if ‘‘partial’’ (value of SaleCondition ) is 0 and “normal” (also value of SaleCondition ) is 1, the model might think “normal” is larger than “partial”.

For the ordinal data, we should “split” the values to new binary features (also known as OneHotEncoding) :

SaleCondition = pd.get_dummies(combine['SaleCondition'], prefix='SaleCondition')

The code above will generate new features based on SaleCondition values, such as SaleCondition_Partial and SaleCondition_Normal (both of them are binary feature, 1 if true and 0 otherwise).

Data Preparation : Standardize Data

Doing Standardization or Normalization on training data is important because most of machine learning models will converge much faster if the features are on the same scale. Decision Tree and Random Forest are the few models where we don’t need to worry about feature scaling.

Doing Standardization will centralized the feature’s mean to 0 and it’s standard deviation to 1. The data will be distributed on normal distribution.

We compute the mean and standard deviation of a particular feature, and apply above formula on each of observations/values. Conveniently, we could use sklearn’s StandardScaler :

numerical_features = combine.select_dtypes(include=[np.float64, np.int64]).columns.valuesstdsc = StandardScaler()df_train[numerical_features] = stdsc.fit_transform(df_train[numerical_features])

Our final training data will have 1459 observations and 201 features. Having large (relative to the number of observations) and highly correlated features means regularization will probably work well on this training data.

Data Preparation : Split Training & Testing Data

The final thing we need to do before training our models is to split our dataset to training and testing set. We need to do this so we could estimate the predictive power of our model by predicting the unseen data (testing set).

Some algorithms also have different hyperparameters that we need to set and test manually. Hence, we will need another set that called validation set.

For example, we could split to 60% of our data to training set, 20% to validation set, and the remaining 20% for the test set. However, splitting our data this way will reduce the number of observations which can be used for our model to learn.

A solution to this problem is to use Cross Validation procedure. This procedure will remove the need of validation set, so in this example, the model will be able to learn from 80% of our data.

We will use k-fold Cross Validation. The training set is split into k sets. The following procedure will be used for each fold in k (for example, k = 5) :

  1. The model will train using 4 folds of data
  2. The only fold that not used for the training will be a “test set” in this iteration
  3. Move the “test set” fold
  4. Repeat above procedure until all folds are used for test set once
  5. Compute the average score (via scoring parameter)

The k-fold Cross Validation is computationally expensive because the model will train the data k times. However, for small dataset such as this house prices dataset, my i3 laptop could run the 10 fold cross validation reasonably fast for most algorithms.

We will follow k-fold Cross Validation procedure later when training our models. For now, lets prepare the data :

X_train, X_test, y_train, y_test 
= train_test_split(df_train, df_target, test_size=0.2, random_state=1)

Above code will split 80% of our data to training set (X_train and y_train) and the remaining 20% to the test set. random_state parameter is important for reproducibility.

3. Training Models

It’s time to start training the models to predict the house prices. We will explore some famous algorithm and compare their predictive power.

We will try both parametric and non - parametric algorithms. Parametric means the algorithm will have finite number of parameters. One of the simplest example of parametric algorithm is Linear Regression. This model always have a finite number of parameters (weight coefficients of each feature), regardless of the number of observations.

In contrast, non-parametric algorithms don’t have a finite number of parameters. The size of parameter will grow with the number of observations. For example, the height of Decision Tree will grow with the size of training data (assuming we don’t prune it).

Parametric Algorithms :

  1. Linear Regression
  2. Linear Regression with L1 Regularization (Lasso)
  3. Linear Regression with L2 Regularization (Ridge)
  4. Linear Regression with L1 & L2 Regularization (ElasticNet)

Non-Parametric Algorithms :

  1. SVM with RBF kernel (non linear)
  2. Decision Tree Regressor
  3. Random Forest Regressor (ensembling)
  4. XGBoost Regressor (boosting)

Lets start training those models, one by one.

Training Models : Linear Regression

You can check the complete code in training/TrainLinReg.py

Linear Regression is an algorithm that use linear equation to draw a “line of best fit” to our training data. The goal of Linear Regression is to describe the relationship between each features and target variable by learning the weight coefficient of each features.

Lets try to fit our training data directly to a basic Linear Regression model :

model = LinearRegression(n_jobs=-1)model = model.fit(X_train, y_train)

The parameter n_jobs is used for better performance. The -1 value tell the model to use all available cores in our processor to do the computation. This parameter is available in other models as well.

Let’s do a 10 fold cross validation and check the average mean squared error :

scores_lin_reg = cross_val_score(model, X_train, y_train, cv=10, scoring='mean_squared_error')

The model return the following result :

Train R² : 0.915017392561
Train MAE : 15345.1233933
CV MSE (Mean) : -3.99818567166e+29, STD : 1.15033594621e+30
Test R² : -1.11363248899e+19
Test MAE : 1.86831400007e+13

R² is known as “Coefficient of Determination”. It gives use an idea of how many data points fall within the results of the line formed by the linear equation. The best possible score is 1.

The result is a clear indication of overfitting (in fact, this looks like a very extreme overfitting). The model has a reasonably high R² score in training data (0.915). However, the 10 fold CV and test result is broken (look at 1.86e+13 mean absolute error).

This is a good chance to prove 2 things :

  1. Regularizations (L1, L2 or both)
  2. Feature Selection (by Lasso/Random Forest/etc)

We will do both on the next part.

Training Models : Lasso/L1 Regularization

You can check the complete code in training/TrainLasso.py

Lasso is a Linear Regression with L1 regularization. We can specify the regularization strength by modifying alpha parameter. Doing L1 regularization will lead to a sparse feature. This means some features will have 0 coefficients (hence not used). This is the reason why Lasso could be used for feature selection.

To make our life easier, sklearn have LassoCV with built in k-fold validation. This means the class will use GridSearchCV automatically on the background (we will talk about GridSearchCV later).

We could set the alphas parameter (list of alpha) and the model will return the chosen alpha as alpha_ attribute. Conveniently, we don’t even have to pass the alphas parameter. The model is smart enough to pick a good alpha_ for us.

model = LassoCV(
cv=10,
max_iter=50000,
n_jobs=-1,
random_state=1
)

The max_iter parameter means maximum iteration the algorithm will perform as long as it’s not converge. The model will give us a warning if the number of iteration already reach max_iter before the algorithm converge.

The model return the following result :

Best Alpha : 166.606822805
Train MAE : 16602.4195964
Train R^2 : 0.89049288923
Test MAE : 17409.4759498
Test R^2 : 0.89815991391

The L1 regularization successfully reduce the overfitting. The model works reasonably well on the test set, and the R² difference between training set and test set is very small (approximately 0.008).

The weight coefficients of each feature is stored inside the coef_ attribute. I use this attribute to do features selection by removing features that have 0 coefficient. It turns out the Lasso only use 72 out of 201 features (the remaining 129 features have 0 weight).

Next, let’s train Linear Regression with the reduced features from Lasso.

I’m not really sure about the chosen alpha at first (despite of the good score) because it seems a little bit extreme. So I try to manually set the alpha to much smaller values later on the 2nd iteration. So stay tuned.

Training Models : Linear Regression with Feature Selection (by Lasso)

You can check the complete code in training/TrainLinReg.py

The basic Linear Regression suffer from extreme overfitting with default features. Lets re-run the model with reduced features from Lasso :

X_train = X_train[filtered_features]
X_test = X_test[filtered_features]
model = LinearRegression(n_jobs=-1)
model = model.fit(X_train, y_train)

Let’s take a look at the result :

Train R^2 : 0.906467558636
Train MAE : 16226.9548914
CV MSE (Mean) : -1339325477.56, STD : 1437247128.23
Test R^2 : 0.893009775964
Test MAE : 17749.2327002

The overfitting is gone. This proves that performing feature selection will reduce overfitting. Intuitively, reducing the features will simplify the model so it will generalize better on unseen data (test set).

Training Models : Ridge/L2 Regularization

You can check the complete code in training/TrainRidge.py

The L1 regularization is working great (at least when we compare it prior to regularization). However, I’m curious to see the effect of L2 regularization. It is a less extreme regularization compared to L1. The result of L2 regularization will not be as sparse as L1 (there will be less zeros coefficients).

We can use RidgeCV, similar with LassoCV we used earlier :

model = RidgeCV(cv=10)

model = model.fit(X_train, y_train)

Note that similar to LassoCV, RidgeCV also capable to automatically choose the best alpha for us.

Best Alpha : 10.0
Train MAE : 16955.7932398
Train R^2 : 0.876719148189
Test R^2 : 0.891734627496
Test MAE : 18227.3300117

The chosen alpha is not as extreme as L1 regularization (we get 166.6 alpha earlier) and it’s working as intended to reduce overfitting.

However, L1 regularization seems a better choice for this dataset because it gives us slightly better R² and MAE score for the test set. Lets confirm this by training ElasticNet.

Training Models : ElasticNet/L1 & L2 Regularization

You can check the complete code in training/TrainElasticNet.py

ElasticTrain combine L1 & L2 regularization. This is done via l1_ratio parameter. The l1_ratio fall between 0 and 1, 0 means the model will use L2 regularization (identical to Ridge) and 1 means the model will use L1 regularization (identical to Lasso). If the ratio is between 0 and 1, the model will combine L1 & L2 regularization.

Similar to RidgeCV and LassoCV, this model also have a built in ElasticNetCV.

model = ElasticNetCV(
cv=10,
l1_ratio=[.1, .5, .7, .9, .95, .99, 1]
)

The model will choose the best l1_ratio from the list we passed on above. Lets look at the result :

Best Alpha : 166.606822805
Best L1 Ratio : 1.0
Train MAE : 16602.4195964
Train R^2 : 0.89049288923
Test MAE : 17409.4759498
Test R^2 : 0.89815991391

The best L1 ratio is 1. This means the model is not using L2 regularization at all. Hence, the chosen alpha and score result is identical with LassoCV.

Click here for the part 2