How Machine Learning can be applied to the real state market

Published in

Neuronio

5 min readAug 6, 2019

Note: a Portuguese version of this article is available at “Como o Aprendizado de Máquina pode ser aplicado ao mercado imobiliário”

If you have bought or sold anything, you already have the question: What’s the price of this product?

To answer this question, you probably need to search trying to get a price interval for this product. Maybe you have hired a specialist to see the product and give you some insight about the price.

With properties is not different, you will need to study the real state market. Maybe you will need help from a real state agent in your study. This is not an easy task, maybe you will spend some time doing a lot of research.

So, a tool to get insights about houses prices more quickly can help you make a decision whether to buy or sell a house.

Thus, a possible Machine Learning application is a model to predict the houses prices, and the goal of this article is to create a model to do this job.

Let’s start the process.

The dataset and preprocessing

To get better results from our model, we can apply some actions to try to improve the data quality to help the model identify the good features from the data. These actions are what we call data preprocessing.

In this example we will use the data from a Kaggle competition, available here.

This is a simple dataset with several informations about houses, like the number of bedrooms, number of garages, built year, etc. Moreover, we have the target feature, the price variable which we need to predict.

So, we will use the competition to evaluate our model when we finish.

Now, after we got the data we need to see how that are. For that we will use the pandas-profiling module. This module has tools to prepare an Exploratory Data Analysis (EDA) with all features from our data and we can save an html file too.

Print from result of the pandas-profiling

You can check result of pandas-profiling here, hosted by Surge tool.

With the result of pandas-profiling we can see this dataset has some missing values and some features are very unbalanced.

The first step of preprocessing we will do is fill the missing values. Some features we fill with None, or 0, option, others features that don’t accept these options, we will fill with the most common value.

Feature Engineering

An important step in data preprocessing is called Feature Engineering. This is a process to do changes in feature set, like create new features, modify existing features or drop another features.

After we filled the missing values, we can apply some feature engineering. First we create some features:

Creating features with other features

HasAlleyAccess is a new feature created from the Alley feature, because Alley is the type of alley and the most frequent is ‘None Alley Access’, so we just create a new feature to represent if a house has or not an alley access;
TotalPorchSF is the sum of the others porch square feet features(OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, WoodDeckSF), because the others features are very unbalanced;
HasMasVnr is feature created from MasVnrArea, this featured was created because the data from this featured are unbalanced to and has a lot of 0 values;
Has2ndFlr is created from 2ndFlrSF and mean if the house has second floor or not;
TotalHouseSF is the sum of the total square feet of the house.

Now we will change some numeric and categorical features. With the pandas-profiling we can see the numerical features distributions and apply some processing. In some features we transform the numerical data in categorical, separating them in categoric intervals.

The numerical data we will transform are: LotFrontage, LotArea, TotalBsmtSF and 1stFlrSF. For that we will use the cut() function from pandas module.

Numerical to categorical transformation example

In above example, bins are the numerical intervals and labels are categorical intervals.

Then we transform the categorical data in numerical to use in the model. Some features we can just map to integer values, because they have ordinal behaviour.

The features that don’t have ordinal behaviour, we will apply One-Hot encoding, the technique that basically creates a new feature for each unique value from a categorical feature. To apply that we can use the get_dummies() function from pandas module.

We can remove some unbalanced and correlated features. Also there are features we don’t need anymore. We will remove some features like: Utilities, GarageYrBlt, 2ndFlrSF, MasVnrArea, PoolQC, Alley, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, Street, MiscFeature, PoolArea, RoofMatl, LowQualFinSF, Heating, MiscVal, WoodDeckSF.

If you are interested in Feature Engineering, or you have doubts, see that series of article, maybe this will help you understand better this process.

The Model

Now we can use the data to train our model. We will use the train data and the test data. We have a lot of options to use, but in this example we will use the XGBoost to make our predictions.

As baseline we used the Root Mean Squared Logarithmic Error (RMSLE) to see the evolution of the train during 150 rounds. The rmsle metric is good if we don’t want to penalize huge differences, in our case we need to predict huge values, so the differences of our prediction can be large. As rmsle we have another metrics, if you want to read about metrics you can see that article. The results are:

During the model train, we can see the evolution of our model in each round. You can see a piece of train below:

After the train finished, we can use the model to predict the test data:

So, we can use the model to predict a house price if we have the same features used the training.

Now, we can use the Kaggle competition to see the accuracy of our model.

Conclusion

An application to Machine Learning in the real state market, as described in this article, is the house prices prediction. The ML model can be used to develop a tool to give the price of a house given the house features. So, anyone with the intention to buy or sell a house, can use this tool to save time in a house prices search.

You can check all the code on Github.

Educ71/house-price

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

References

Pandas Profiling: https://github.com/pandas-profiling/pandas-profiling
Competition discussion: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion
Understanding Feature Engineering: https://towardsdatascience.com/tagged/tds-feature-engineering
Surge: https://surge.sh/
XGBoost: https://xgboost.readthedocs.io/en/latest/