HDSC Stage F OSP: Predicting House Prices

Ofulue Elizabeth
Hamoye Blog
Published in
6 min readNov 25, 2020
Image credit: Breno Assis (unsplash.com)

As part of the Hamoye Data Science Internship, Stage F involves working on an open source project as a team.

This team worked on predicting the price of houses based on specific choice features for some states in the USA.

This article explains the approach used in cleaning, analysing, and modelling the data to predict house prices.

The dataset contains the prices of different houses with different features such as the number of bathrooms, bedrooms, views, waterfront, floors, city etc.

Data Cleaning

The data was loaded and observed for data quality and tidiness.

It is observed that 49 entries have zero house price and 2 entries with zero number for bedrooms and bathrooms.

The prices for houses with zero bedrooms and bathrooms were also outrageous compared to houses with such facilities. This could be an entry error. However, these entries will be removed from the data to improve the accuracy of analysis and results.

Data Exploration

The distribution and price of houses were explored based on different features like the year each house was built, the number of bedrooms and bathrooms, and the city it is situated in.

Distribution and house prices based on building year.

From the graphs above it can be seen that:

  1. Few houses were built in the earlier years compared to recent years.

2. No significant effect is shown in the price of a house based on the year it was built.

3. Although a significant spike in price was observed around 1991, houses built in the early and later years were costlier compared to other years.

4. The earlier houses could be due to historical effect, while the recent ones is anticipated.

Distribution and house prices based on city.

From the graphs above, it is observed that:

  1. Cities with a large distribution of houses were low-priced while those with a small distribution of houses were costly. An example is Seattle with the highest number of houses built despite its low price. Medina and Clyde Hill cities, on the other hand, have high prices for houses but very few houses were built in these cities.

House prices based on the number of bedrooms, bathrooms, area of living space, and area of the house.

From the charts above, it can be observed that the price of houses increased in direct proportion to the higher number of bathrooms, bedrooms, area of living space, and area of the house.

Data Preprocessing

It was noticed that the feature, yr_renovated contains zero values which could mean that those houses have never been renovated. These zero values were therefore replaced with the equivalent yr_built values for the purpose of this analysis.

Four variables (date, street, state zip, country) were also dropped because they were not needed.

The data preprocessing for proper analysis includes;

  • Encoding the categorical variable “city” into numerical variables using the Label Encoder.
  • Detecting and removing outliers using the 1st, 2nd, 3rd, 5th, 10th, 50th, 90th, 92nd, 95th, and 99th percentile.
  • Checking the correlation coefficient for the possible presence of multicollinearity.

From the above matrix, it is observed that there is a strong correlation between variables “sqft_living”, “sqft_above”, “bathrooms”, “bedrooms”, and “floors”.

Variables “sqft_living” and “sqft_above” were therefore removed to avoid multicollinearity.

  • Transforming the target variable.

Modelling

For the purpose of this analysis, 80% of the data was used for training the models, while 20% was used for testing.

Several models were explored to obtain best performance for the data. Linear Regression, Random Forest, K-Neighbor, Support Vector, Lasso and Ridge Regression were used.

Boosting techniques such as Extra gradient and Catboosting were also used for better performance.

Model Performance

Of the models used, Catboost produced the best performance for the data with R2 = 0.64 and RMSE = 0.091.

Feature Importance

To understand the contribution of each feature to the model, the model’s feature importance was obtained and plotted.

From the above plot, it can be deduced that the variable “city” has contributed the most while “waterfront” contributed the least to the model.

Pipeline

A Kubeflow pipeline was created to manage and automate the house prediction model.

A web application was also developed to predict the price of houses using the model obtained from this analysis.

Summary

The following are key points from the analysis carried out:

  • It was observed that in predicting the price of a house in the USA, 11 key features are required: the number of bedrooms, the number of bathrooms, land occupied (sqft_lot), floors, waterfront, view, condition, area of basement, year built, year renovated, and city.
  • Of the nine(9) algorithms considered, CatBoost is the best algorithm for the model as it accounts for 64.3% variation in the dataset.
  • The low R2 score can be attributed to the dataset as more data is recommended for an improved score and better performance of the model.

Recommendations

For further studies into house price prediction models:

  • Economic variables like inflation, construction cost, population increase, and housing demand should be taken into account in estimating house prices. This helps to capture prevailing market conditions and economic growth of the country at any time.
  • Larger datasets and model imputation for houses with no price should be utilized for better results.
  • Other possible feature engineerings can be considered.
  • Training the data model with a transfer learning approach can also be considered.

TEAM MEMBERS

  • Qudus Opeyemi Adebayo
  • Elizabeth Ofulue
  • Amoo Eno
  • Iloh Miracle Ugochukwu
  • Chidinma Kalu
  • Osagie Eboigbe
  • Abubakar Alaro
  • Chibuikem Nwagwu
  • Bello Faheedah Bukola
  • Adeyemi Anuoluwapo

--

--