Auto Valuation Model

Auto-valuation evaluates rental and resale pricing of the properties in top cities of India.

Published in

Engineering @ Housing/Proptiger/Makaan

5 min readApr 5, 2022

An Automated Valuation Model (also referred to as AVM) is a term used to describe a service that leverages a machine learning models to provide a real-estate property value.

Overall Summary

In this blog, we mainly talk about what is AVM and High-level overall lifecycle-

EDA + feature engineering
Modeling
Evaluation
Deployment
Live demo

Before deep-diving lets see a high level flowchart of AVM

Data collection

All historical data (2016 onwards)which is uploaded to the housing.com site was considered for AVM training. As Housing.com is one of the leading companies in real-estate category the historical data size for all segments(rent + resale) is huge.

EDA + Feature engineering

Understand the problem. We looked at each variable and do a philosophical analysis of their meaning and importance for this problem.
Univariate study. We explores each variable in a data set, separately. We looked at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. For example figure 1 shows univariate analysis of feature ‘listing_price’

Fig 1 — listing_price Univariate Analysis

Multivariate study. We tried to understand how the dependent variable and independent variables relate. For example Fig-2 shows linear relationship between Sale Price and Area of the house.

Fig 2- Multivariate study b/w Sale Price and Area

Basic cleaning. We cleaned the dataset and handle the missing data, outliers and categorical variables. Please refer Fig-3 for missing value imputation.

Fig 3— Missing Value imputation technique

Test assumptions. We checked if our data meet the assumptions required by most multivariate techniques.
Feature transformation and extraction. We applied a mathematical formula to a particular column(feature) and transform the values which are useful for our further analysis for example by applying Log transformation on SalePrice right skew distribution (Fig-4 Left) transformed to near normal distribution (Fig-4 Right)

Few selected features after EDA + feature engineering —

price
property age
size
bedrooms
bathrooms
balcony
furnishing
floor
total floors

Note — In practice, we used more than 30 features

Modeling

For the record, we tried several machine learning algorithms like Linear regression, SVM, XgBoost, Random Forest etc but Random forest outperform them.

The plus point of Random forest is it prevents overfitting and it is less impacted by outliers in data and we know already our original data set is noisy.

We trained Random forest models with K-fold Cross-validation technique. Mean we Divide data into k folds and iterate training on k-1 folds ensuring each fold is considered as a validation set once.

In this scope, we are not going into detail about Random Forest because there are plenty of articles on Random Forest on internet.

We didn’t get good results when we train a single model for all cities. And the reason is high variability and noise in the real estate data. To overcome this issue we trained models for each city and segments

Let’s break down how and what different models are trained —

Project Apartment — Apartments which is part of a Project
Non-Project Apartment — Apartments which does not belong to any Project
Independent Floor and House

For the above three segments for each city, we trained three models for rental and resale property. So total 6 models for each city.

For various cities and segments, we then fit our training data into the Random Forest model and check for accuracy.

Hyper parameter tuned while training -

n_estimators — No of decision tree to be used

max_depth — We can set the maximum depth of our decision tree using the max_depth parameter. The more the value of max_depth, the more complex your tree will be.

min_samples_split — Here we specify the minimum number of samples required to do a split. For example, we can use a minimum of 10 samples to reach a decision.

min_samples_leaf — Represents the minimum number of samples required to be in the leaf node. The more you increase the number, the more is the possibility of overfitting.

max_features — It helps us decide what number of features to consider when looking for the best split.

Evaluation

Mean Absolute Percentage Error is used as an evaluation metric

The reason why we select MAPE as an evaluation metric ?—

The Price range varies quite a lot in our entire data set and choosing other than absolute term will penalize more when the price is high

We got an average accuracy of ~90% across all trained models.

Deployment

The simplest way to deploy a machine learning model is to create a web service for prediction. In this example, we use the Flask web framework to wrap random forest classifiers then dockerize the flask application and deploy it on the cloud.

AVM in action

Sharing the link to the Housing.com price prediction application. Do try out and Happy prediction

Property Valuation Calculator: Estimate Market Value for a Property Online | Housing.com

Find the right value & price for a property using the Free Property Valuation Calculator on Housing.com…

housing.com

References

sklearn.ensemble.RandomForestClassifier

A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on…

scikit-learn.org