Top 5% Housing Price Prediction — Beginner’s Guide

Felipe Braga
Analytics Vidhya
Published in
7 min readJul 16, 2020

The dataset used in this walkthrough is the Ames Housing dataset compiled by Dean De Cock to be used in data science education and chosen to be used in Kaggle’s House Prices: Advanced Regression Techniques competition. I participated in this competition and decided to make an article for new data science students. This guide will be separated into 4 steps:

  1. Exploration
  2. Cleaning
  3. Feature engineering
  4. Modeling and prediction

Exploration

The first step is to take a look at our features and make some predictions about what influences the price of a house. Thinking more abstractly, the location, size, quality, newness, and luxury features of the house have a high correlation with the price based on personal experience. If our predictions are confirmed, we can use them for feature engineering later on.

Below we can see the absolute Pearson correlation of our features with our target variable. Our guess about what influences the house’s price is correct and let us keep this in mind for feature engineering later on. Remember to look at the absolute correlations since there can be features with a strong negative correlation.

SalePrice correlation with features

Now that we already have an idea of our important variables, let's start off with a univariate analysis of our target variable since it’s the most important.

Histogram of SalePrice
Histogram of SalePrice

The distribution is peaked (kurtosis), positively skewed (skewness), and clearly deviates from a normal distribution. Below you can see the values for SalePrice and if you don’t know what these values mean, check out this article.

Skewness: 1.882876
Kurtosis: 6.536282

This distribution is also present in other features such as GrLivArea.

Histogram of GrLivArea

If we go on to choose a model that is sensitive to heteroscedasticity (unequal variance in error terms across the feature’s values), this will need to be fixed with a transformation such as Box-Cox or applying a logarithm.

We analyzed our target variable, but how do our features relate to it? This is called bivariate analysis. Below we can see a scatter plot of SalePrice vs TotalBsmtSF. It seems that their relationship is linear (slightly exponential?) and we can start seeing some outliers such as the house on the rightmost side of the graph which has a huge basement with more than 6000 square feet but has low sale price and doesn’t follow the trend. This outlier could be a house in a rural area which explains the difference in price and area.

SalePrice x TotalBsmtSF scatter plot

Same as above. We can see the same outlier with a living area of more than 5000 square feet! We’ll have to remove them or else our model will be trained with outliers and our predictions will worsen.

SalePrice x GrLivArea scatter plot

For my outlier removal, I manually went through the features and plotted box and scatter plots for each of them and analyzed the outliers to see if they should actually be removed.

Cleaning

The first thing that should be done is making sure your features have the correct data types. In this dataset, we have a lot of columns such as YrSold, MoSold, and YrBuilt which are considered as numbers, but they should be considered as categorical variables.

Fixing column data types

It may seem strange to treat a year as a categorical variable. However, we can’t do reasonable mathematical operations with these numbers. Think about it this way, is the year 2000 two times as much as the year 1000? It is not and we can’t reason like that so it should be categorical.

Next, we’ll remove all columns with a high percentage of missing values.

Missing values percentage per column

The remaining columns still have missing values and we’ll have to impute them. I like to treat numeric and categorical features separately so I’ll start off with the numeric ones.

Missing numeric columns

For the numeric columns, I filled most of the missing values with 0 since it means that the feature isn’t present in the observation. If GarageCars or TotalBsmtSF is missing, the house doesn’t have a garage or basement. The only column which I imputed differently was LotFrontage since this can be determined by zoning laws so I imputed it with the median of the neighborhood.

Imputation of numeric columns

We’ll repeat the same process for categorical columns. I replaced the majority of missing values with “None” since it means that the observation doesn’t present the feature and the remaining ones were imputed with their respective modes. For imputing the MSZoning column, the mode was taken from observations of the same MSSubClass which identifies the type of house sold.

Imputation of categorical columns

Our last steps will be to fix the skewness by making our data more Gaussian-like and removing features with a very low variance since they’re uninformative to machine learning models.

  • There are more complicated ways to make our features more normal such as using a Box-Cox or Yeo–Johnson transformation, but I’m just going to use a simple log(x + 1) transformation. I’m adding the constant since some of our features such as TotalBsmtSF can be 0 and log(0) isn’t possible.
  • Low variance features are uninformative because our model will overfit to the observations with different values. If the predominant value in the column is more than 99.5% it will be dropped.

In the following graphs, we can visually see how our data is more gaussian (normal distribution) after the log transformation.

GrLivArea after log transformation
SalePrice after log transformation

Feature Engineering

This is where creativity comes into play. Remember the abstract properties that define a house’s price that I talked about in the exploration section? We will use those properties to help us think and create more features from our existing ones. This can improve your score so much and there are competitions that have been won by impressive feature engineering instead of using advanced models. Below you can see some new numeric features such as YearsSinceBuilt and TotalOccupiedArea which refer to newness and size as we discussed before.

We can also make features from categorical columns! Some of our categorical columns are ordinal which means that they have a natural ranking and we can use that to make new features. Each of the values was replaced with an integer representing its “rank”.

map1 = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0}train_test_features["TotalGarageQual"] = train_test_features["GarageQual"].replace(map1) * train_test_features["GarageCond"].replace(map1)
train_test_features["TotalExteriorQual"] = train_test_features["ExterQual"].replace(map1) * train_test_features["ExterCond"].replace(map1)

If you’re wondering why I didn’t apply this encoding to the column themselves, it’s because the distance between each type is sensitive to the viewer’s judgment and isn’t necessarily equal so I decided to leave them as-is and hot-encode later. I still mapped them to make these new features because they are informative for our model and the trade-off isn’t the same as actually encoding the ordinal columns.

Modeling

Here comes the fun part, machine learning! I decided to use a VotingRegressor from the sklearn library which basically combines multiple regressors and averages the predictions to generate a final one. Which base models did I use?

  • Lasso
  • Ridge
  • ElasticNet
  • GradientBoostingRegressor

Our data had a lot of linear relationships so I used Lasso, Ridge, and ElasticNet which are popular regressors for linear regression. They basically implement the common least-squares linear regression but have a penalty for coefficients with a high value. Lasso regression can even set some coefficients to 0 and remove unimportant variables

GradientBoostingRegressor is the only odd-one-out since it isn’t necessarily used for linear regression, but I added it in to balance out the biases assumed by the linear regression models.

Prediction code

You could hyper-tune the parameters and find better coefficients for the models, but I decided to make it simple and leave it as is. Also, remember to convert back the predictions since our model will output them as if they have been log-transformed as well. In NumPy, you can do this with the exmp1 function.

Result

np.sqrt(-cross_val_score(pipe, X_train, y_train, scoring='neg_mean_squared_error')).mean()

After running cross-validation of our model on the training data, the final score was 0.1002929 and on Kaggle the official score was 0.12070 which is in the top 5% of scores for this competition at the time that I wrote this.

I hope this helps out and inspires you on your data science journey and if you’d like my Jupyter notebook I’d be pleased to send you it.

--

--

Felipe Braga
Analytics Vidhya

Interested in business, data science and software engineering.