Auto Valuation Model
Auto-valuation evaluates rental and resale pricing of the properties in top cities of India.
An Automated Valuation Model (also referred to as AVM) is a term used to describe a service that leverages a machine learning models to provide a real-estate property value.
Overall Summary
In this blog, we mainly talk about what is AVM and High-level overall lifecycle-
EDA + feature engineering
Modeling
Evaluation
Deployment
Live demo
Before deep-diving lets see a high level flowchart of AVM
Data collection
All historical data (2016 onwards)which is uploaded to the housing.com site was considered for AVM training. As Housing.com is one of the leading companies in real-estate category the historical data size for all segments(rent + resale) is huge.
EDA + Feature engineering
- Understand the problem. We looked at each variable and do a philosophical analysis of their meaning and importance for this problem.
- Univariate study. We explores each variable in a data set, separately. We looked at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. For example figure 1 shows univariate analysis of feature ‘listing_price’
- Multivariate study. We tried to understand how the dependent variable and independent variables relate. For example Fig-2 shows linear relationship between Sale Price and Area of the house.
- Basic cleaning. We cleaned the dataset and handle the missing data, outliers and categorical variables. Please refer Fig-3 for missing value imputation.
- Test assumptions. We checked if our data meet the assumptions required by most multivariate techniques.
- Feature transformation and extraction. We applied a mathematical formula to a particular column(feature) and transform the values which are useful for our further analysis for example by applying Log transformation on SalePrice right skew distribution (Fig-4 Left) transformed to near normal distribution (Fig-4 Right)
Few selected features after EDA + feature engineering —
- price
- property age
- size
- bedrooms
- bathrooms
- balcony
- furnishing
- floor
- total floors
Note — In practice, we used more than 30 features
Modeling
For the record, we tried several machine learning algorithms like Linear regression, SVM, XgBoost, Random Forest etc but Random forest outperform them.
The plus point of Random forest is it prevents overfitting and it is less impacted by outliers in data and we know already our original data set is noisy.
We trained Random forest models with K-fold Cross-validation technique. Mean we Divide data into k folds and iterate training on k-1 folds ensuring each fold is considered as a validation set once.
In this scope, we are not going into detail about Random Forest because there are plenty of articles on Random Forest on internet.
We didn’t get good results when we train a single model for all cities. And the reason is high variability and noise in the real estate data. To overcome this issue we trained models for each city and segments
Let’s break down how and what different models are trained —
- Project Apartment — Apartments which is part of a Project
- Non-Project Apartment — Apartments which does not belong to any Project
- Independent Floor and House
For the above three segments for each city, we trained three models for rental and resale property. So total 6 models for each city.
For various cities and segments, we then fit our training data into the Random Forest model and check for accuracy.
Hyper parameter tuned while training -
n_estimators — No of decision tree to be used
max_depth — We can set the maximum depth of our decision tree using the max_depth parameter. The more the value of max_depth, the more complex your tree will be.
min_samples_split — Here we specify the minimum number of samples required to do a split. For example, we can use a minimum of 10 samples to reach a decision.
min_samples_leaf — Represents the minimum number of samples required to be in the leaf node. The more you increase the number, the more is the possibility of overfitting.
max_features — It helps us decide what number of features to consider when looking for the best split.
Evaluation
Mean Absolute Percentage Error is used as an evaluation metric
The reason why we select MAPE as an evaluation metric ?—
The Price range varies quite a lot in our entire data set and choosing other than absolute term will penalize more when the price is high
We got an average accuracy of ~90% across all trained models.
Deployment
The simplest way to deploy a machine learning model is to create a web service for prediction. In this example, we use the Flask web framework to wrap random forest classifiers then dockerize the flask application and deploy it on the cloud.
AVM in action
Sharing the link to the Housing.com price prediction application. Do try out and Happy prediction