Predicting House Prices using Machine Learning

Simple tutorial on how to conduct Exploratory Data Analysis

Shahwaiz
6 min readFeb 5, 2019

We will predict the sale prices of homes in Ames, Iowa, based on 79 descriptive features. The training set has 1460 instances. You can learn more about this project here on Kaggle.

We will approach the problem by dividing the 79 features into numerical and categorical features, depending on the data type of each feature. We will then identify the features that play a vital role in determining the sale price (target variable) of the house. Along the way, we will explore our data and check for any data quality issues, such as outliers or missing values, that could adversely affect our model. Toward the end, we will use the chosen features to train our machine learning model. We will use Pandas for data analysis and Seaborn for data visualization. You can view my code here.

I am going to write column names verbatim from the training set. Some of the column name’s description can be deduced from its name, such as GarageYrBlt, which contains the year when garage was built. Descriptions of other features, such as GrLivArea, which tells us the living area square feet, are hard to determine based on its abbreviation. You can view the full description of each column here.

Numerical Features

We are going to deal with numerical features first. trainingSet.select_dtypes(include = 'number').isna().any gives us the list of all numerical columns and tells us what columns have missing values. 3 of the 37 numerical columns have missing values: LotFrontage, MasVnrArea, and GarageYrBlt.

But how do we know what numerical features influence the sale price of the house? One approach is to create an annotated heatmap. This will allow us to easily see how strongly is each variable correlated with the other variables. Each cell contains the correlation coefficient, telling us the strength of linear relationship between two variables.

Heatmap containing 37 numerical features

We are interested in finding what features play a significant role in determining the sale price of the house. We are going to set a threshold value and include all the numerical features whose correlation coefficient is greater than that threshold. We set our threshold to 0.45 and get the following features: OverallQual, YearBuilt, YearRemodAdd, MasVnrArea, TotalBsmtSF, 1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea.

Since LotFrontage falls below our threshold, we choose to drop it, and ignore the NA values present in that column. MasVnrArea and GarageYrBlt have 8 and 81 missing values, respectively. We can replace the missing values in the MasVnrArea column with the median of this column. This is called imputation — replacing the missing values with an estimated value based on the feature values that are present. trainingSet.MasVnrArea.fillna(tr.MasVnrArea.median(),inplace=True) will replace 8 null values with the median of MasVnrArea column. For GarageYrBlt, we can replace the missing values of the rows with the value of its corresponding YearBuilt value. Since these two variables are strongly correlated (with a correlation coefficient of 0.83), we can choose to drop GarageYrBlt. This leads us to the notion of redundant features.

We can also use the heatmap to help us find redundant features — descriptive feature that is strongly correlated with another descriptive feature. This will also help us deal with the curse of dimensionality. We see that GarageCars and GarageArea are strongly correlated (since a bigger garage can fit more cars) and have the same correlation to the target feature (SalePrice). Therefore, we are going to only include GarageArea when building our machine learning model. Furthermore, TotalBsmtSF and 1stFirSF are also strongly correlated to each other and have the same correlation to SalePrice; we are going to keep TotalBsmtSF and discard 1stFirSF.

The list of numerical features we feed our model is as follows: OverallQual, YearBuilt, YearRemodAdd, TotalBsmtSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageCars, Fireplaces, MSSubClass. This method of reducing the number of descriptive features in a dataset to just the subset that is most useful is called feature selection. The goal of feature selection is to identify the smallest subset of descriptive features that maintains the overall model performance.

Extra: We use trainingSet.plot(x = "GrLivArea", y = “SalePrice", kind = "scatter") to check for outliers by plotting a scatter plot of GrLivArea vs. SalePrice.

The two points on the bottom right can be outliers. We can choose to drop the rows(or instances) associated with these two points or plot more graphs to see if the IDs we identified as outliers in this graph correspond to the IDs of the outlier we get in the second graph.

This graph also demonstrates a strong correlation between SalePrice and GarageArea. We can mark the 4 points on the lower right to be outliers. We can use trainingSet.loc[trainingSet.GarageArea > 1200,["SalePrice"]] to retrieve the Id of those points. We get the following tables:

We notice that Id 1299 appears to be an outlier in both plots. Although we cannot deduce that Id 1299 is an outlier based on only 2 descriptive features, we are going to drop this row for demonstration by using trainingSet.drop(axis=0, index=1298,inplace=True).

Notice that GarageArea is not in our final list. I have included it to demonstrate one basic method of detecting outliers.

Categorical Features

Now it’s time to deal with categorical features and see what features play a significant role in affecting the sale prices of the homes. For the sake of simplicity, I will use 3 features — BsmtQual, ExterQual, ExterCond — to introduce basic techniques on how to convert categorical values to numerical value.

We’ll check if any of our (selected) categorical features have missing values. trainingSet.BsmtQual.isna().sum() gives the total number of missing values in the BsmtQual column. We notice that BsmtQual has 37 NA values; however, these NA values mean that the house has no basement based on the description file of the project. Therefore, we replace 37 NA values with “no”. ExterCond and ExterQual have no NA values.

"""replaces the NA values in the BsmtQual column from both the training set and the test set"""for df in [trainingSet,testSet]:
for i in ['BsmtQual']:
df[i].fillna('no',inplace= True)

Furthermore, the selected columns all have the same traits, making it easier for us to convert them into numerical values. These traits include: Ex for excellent, Gd for good, TA for typical, Fa for fair, Po for poor. An important part of data preprocessing is to convert our descriptive feature into a language that our machine learning model can understand. We use map to manually assign each trait a numerical value; this method of converting is called label encoding. (Scikit-learn has a class, called LabelEncoder, that can do this automatically for you.)

for df in [trainingSet,testSet]:
for i in ['ExterQual','ExterCond','BsmtQual']:
df[i]= df[i].map({'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,no':5})

The way I assigned each value a number is arbitrary; you can assign 1 to Po, 2 to Ex, and so on.

The final list of descriptive features that our model will use to train is as follows:

feature_numerical = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF','GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'TotRmsAbvGrd', 'Fireplaces', 'MSSubClass']feature_categorical = ['ExterQual','ExterCond','BsmtQual']

final_features = feature_numerical + feature_categorical

As a final step, we feed these features to our model for training and make prediction of the sale price for each house in the test set using the gradient boosting algorithm.

from sklearn.ensemble import GradientBoostingRegressor

#Training
gb = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.05, max_depth=3, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber').fit(training_set.values,train_target)
#Predictions
predictions_gb = gb.predict(test_set.values)

training_set.values refers to our training set, containing our chosen features, converted to a NumPy array.

train_target is the SalePrice column (our target value).

test_set.values is the test set converted to a NumPy array.

Final Result

After running this script, we get a Root-Mean-Squared-Error (RMSE) of 0.14859, which puts us in the top 62%. Participants in the top 10% have error around 0.11540. A mere difference of 0.03 can be attributed to selecting only 3 categorical features out of 43 and to the lack of feature engineering. Nevertheless, we can add more features by inspecting the features more closely and by employing other techniques, such as PCA for dimensionality reduction or one-hot encoding for converting categorical values to numerical values, that can lead to a lower RMSE.

Thank you for reading. Please comment if you have any questions or any suggestions.

--

--