Auto Valuation Model

Auto-valuation evaluates rental and resale pricing of the properties in top cities of India.

Niteshyadav
Engineering @ Housing/Proptiger/Makaan
5 min readApr 5, 2022

--

Photo by Francesca Tosolini on Unsplash

An Automated Valuation Model (also referred to as AVM) is a term used to describe a service that leverages a machine learning models to provide a real-estate property value.

Overall Summary

In this blog, we mainly talk about what is AVM and High-level overall lifecycle-

EDA + feature engineering

Modeling

Evaluation

Deployment

Live demo

Before deep-diving lets see a high level flowchart of AVM

AVM Flowchart

Data collection

All historical data (2016 onwards)which is uploaded to the housing.com site was considered for AVM training. As Housing.com is one of the leading companies in real-estate category the historical data size for all segments(rent + resale) is huge.

EDA + Feature engineering

  • Understand the problem. We looked at each variable and do a philosophical analysis of their meaning and importance for this problem.
  • Univariate study. We explores each variable in a data set, separately. We looked at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. For example figure 1 shows univariate analysis of feature ‘listing_price’
Fig 1 — listing_price Univariate Analysis
  • Multivariate study. We tried to understand how the dependent variable and independent variables relate. For example Fig-2 shows linear relationship between Sale Price and Area of the house.
Fig 2- Multivariate study b/w Sale Price and Area
  • Basic cleaning. We cleaned the dataset and handle the missing data, outliers and categorical variables. Please refer Fig-3 for missing value imputation.
Fig 3— Missing Value imputation technique
  • Test assumptions. We checked if our data meet the assumptions required by most multivariate techniques.
  • Feature transformation and extraction. We applied a mathematical formula to a particular column(feature) and transform the values which are useful for our further analysis for example by applying Log transformation on SalePrice right skew distribution (Fig-4 Left) transformed to near normal distribution (Fig-4 Right)
Fig 4— Log Transformation on SalePrice

Few selected features after EDA + feature engineering —

  • price
  • property age
  • size
  • bedrooms
  • bathrooms
  • balcony
  • furnishing
  • floor
  • total floors

Note — In practice, we used more than 30 features

Modeling

For the record, we tried several machine learning algorithms like Linear regression, SVM, XgBoost, Random Forest etc but Random forest outperform them.

The plus point of Random forest is it prevents overfitting and it is less impacted by outliers in data and we know already our original data set is noisy.

We trained Random forest models with K-fold Cross-validation technique. Mean we Divide data into k folds and iterate training on k-1 folds ensuring each fold is considered as a validation set once.

Random Forest architecture

In this scope, we are not going into detail about Random Forest because there are plenty of articles on Random Forest on internet.

We didn’t get good results when we train a single model for all cities. And the reason is high variability and noise in the real estate data. To overcome this issue we trained models for each city and segments

Let’s break down how and what different models are trained —

  1. Project Apartment — Apartments which is part of a Project
  2. Non-Project Apartment — Apartments which does not belong to any Project
  3. Independent Floor and House
Model Distribution

For the above three segments for each city, we trained three models for rental and resale property. So total 6 models for each city.

For various cities and segments, we then fit our training data into the Random Forest model and check for accuracy.

Hyper parameter tuned while training -

n_estimators — No of decision tree to be used

max_depth — We can set the maximum depth of our decision tree using the max_depth parameter. The more the value of max_depth, the more complex your tree will be.

min_samples_split — Here we specify the minimum number of samples required to do a split. For example, we can use a minimum of 10 samples to reach a decision.

min_samples_leaf — Represents the minimum number of samples required to be in the leaf node. The more you increase the number, the more is the possibility of overfitting.

max_features — It helps us decide what number of features to consider when looking for the best split.

Evaluation

Mean Absolute Percentage Error is used as an evaluation metric

The reason why we select MAPE as an evaluation metric ?—

The Price range varies quite a lot in our entire data set and choosing other than absolute term will penalize more when the price is high

We got an average accuracy of ~90% across all trained models.

Deployment

The simplest way to deploy a machine learning model is to create a web service for prediction. In this example, we use the Flask web framework to wrap random forest classifiers then dockerize the flask application and deploy it on the cloud.

AVM in action

Sharing the link to the Housing.com price prediction application. Do try out and Happy prediction

--

--