Predicting Ulaanbaatar’s apartment price
In this project we will create a prediction model to predict Ulaanbaatar city’s house price by per square meter. Web Application here
Let’s look at some recent data to understand the overview of the residential building industry. Majority of the residential buildings are from Ulaanbaatar the capital city of Mongolia and as we can see that peak of the industry was in 2015 and 1.6 trillion tugriks worth of residential buildings were finished. In subsequent years the supply of residential buildings has decreased dramatically by 44 percent in Ulaanbaatar. We may also say that supply has decreased due to a very low demand and other unknown economical factors which results in residential buildings price increase. Is this the reason why sellers are offering high price without considering other factors ? or Is there any other important factors such as crime rate, number of schools and kindergartens, and distance from city center ?
The reason I wanted to make this predictive web application is to encourage both apartment seller and buyer to be reasonable on their prices. Most of the time sellers would offer prices depending mostly on the apartment location such as city center, city “A or B” area and there is no clear term for city “A” center and how it is measured. I wanted sellers to account crime rate in the location, how many schools and kindergartens are available in that specific area. On the other hand, buyers should be able to know apartment prices in different places and there is no place or a web site that shows apartment prices in different places using user inputs such as number of rooms and square meter. Even if there is an e-commerce website like Unegui.mn it can only show places that sellers intended to sell which is a limited resource. Creating web application that can help both parties was my goal and I will walk through important steps creating this web application.
Where should we look for our data?
This was a big problem when we initially decided to implement our idea and there was no place but Unegui.mn to get main dataset. Unegui.mn is an online store that sells variety of commodity, job offers, and most importantly there are sale and rent of apartments. When people entering apartment listing into it there are couple of features that they must fill (location, district, price, number of rooms) and couple of features are optional(year built, number of windows and balcony, direct sunlight, children play ground, garage and etc). After acquiring our main data set we need find our other data sets such as number of kindergartens and schools, distance from city center and crime rate. Some of data sets have manually acquired such as distance from city center (using google maps on each location) and number of kindergartens and schools, and crime rate from public data sets.
Important things to note here is that we have scraped 3863 rows from 2018–12–25 to 2019–02–25 data from Unegui.mn with features:
date- the date of a listing entered,
title- a user generated text referring to a listing,
rooms- number of rooms of apartment,
price- full price or a square meter price of an apartment,
location- user can only choose from about 50 different locations available,
district- district of an apartment.
First 5 rows of a main data frame
Firstly, we should ask how this model can be used and how useful it is if we have a model ?
We started this project to help house sellers and potential buyers to have a general understanding of apartment’s square meter price estimation based on the khoroolol (district), surrounding places (schools and kindergartens), distance from city center and how safe the living area is. These features have significant impact on house prices even though some features like built year of the house, number of windows, direct sunlight have high impact. Our model is useful for anyone who is interested in house prices and the surrounding area information because there is no easy way to access all these information at once.
Goals in this project are:
- Try different predictive models
- Tune the best model’s hyper-parameters
- Make plotly dash application
- Found every nan or infinite values and removed them. Out of 3863 rows only 2941 were valid values.
- Square meter outliers can be calculated by taking the 99th percentile and by looking at values larger than 99th percentile we can see people entered absurd amounts such as 1.7 million square meters. And it is reasonable to think that normal apartment maximum could be 500 square meters at maximum.
- Not only removing larger values but we also need take care of smaller values which is 16sqmtr 1 room apartment and everything less than that should be removed.
- Humans make mistake! and we thought that when people inserting their place to Unegui.mn some of them put whole price and some of them put square meter price which is a big mess
This is before and after data cleaning
You can clearly see that some people entered a huge amount of price offer and it is diluting our main data set. Removing them helps us to keep normal distributed data.
After data cleaning we need to create a new feature which is our prediction target: price per square meter
- To get the price per square meter we simply divide total price by number of rooms
Now we need to merge main data frame with our additional feature: distance from city center, number of schools each district, number of kindergartens each district
- Creating another feature called dis_horoo combination of district + horoo e.g: Чингэлтэй, 1 in order to easily merge multiple data frames without column name conflicts
- In the process of merging data sets we found out that some locations’ number of schools and kindergartens or distance from city center was not found on the internet or certain location is too far from city center or amount of data is too little to use. So we decided to only use specific districts: “Баянзүрх, Хан-Уул, Баянгол, Сүхбаатар, Чингэлтэй” and omitting “Сонгино Хайрхан, Налайх, Багахангай, Багануур”
After merging all the data sets into one big data frame
After merging all the data frames, main data frame’s correlation heat map
Here we can see some individual weak correlations and we have to remember that multiple variables are what determines price!
We will try three different predictive models without hyper parameter tuning
Mean Absolute Error shows how far residuals are from actual data points in absolute number meaning no negative.
Root Mean Squared Error shows us how spread our residuals are from the best fitted line
R squared shows us what percentage of our target variable can be explained by the other variables
Accuracy shows us how close we are to the true value and this is what we will focus on since this is calculated by using (100-Mean Absolute Percentage Error). Simply more accurate it is, less error it makes.
First model we will try is Linear Regression
- Linear Regression is used to study and summarize the relationship between multiple quantitative variables and of course it is a very common and well performing predictive model
- Multiple points in top around 3-4 million tugriks have gone wildly far from the true value and that is a problem we need to keep an eye on
- 83 percent accuracy is decent but can get better
Second model we will try is Random Forest Regressor
- Random Forest Regressor operates by constructing a multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees
- Metrics are getting better and this time what in top outliers have gone closer to the best fitted line
- Accuracy has increased by almost 1 percent
Third model we will try is Xgboost
- Xgboost is short for eXtreme gradient boosting, a better vesion of decision trees and is focused on computational speed and performance
- Notice R squared jumped by almost 10 percent and it is a good sign that decision trees are working out
- Accuracy increased by 1 percent again. No wonder because this is a boosted trees working.
- First two models’ results are really different from others. Mean absolute error and root mean squared errors have about 30,000 price difference even though the r-squared and the accuracy are almost same.
- The best performing model among them is Xgboost Regressor and even though the high r-squared, and accuracy none of the models can give us the output we wanted directly in our hand.
Remember that previous models were struggling to give accurate prediction on expensive places around 4–5 million and that is a problem if we give such highly deviated number for a user. That is the reason why we need to take another approach and try different model. The model does not include factors that most people consider when buying apartment such as quality of the apartment, age of the apartment, direct sunlight, and economic ups and downs. So there is going to be some variability or inconsistency in our model and thus giving out a general price range is recommended.
So what do we do now ? Price range ?
We will use a model that can output prediction quantiles. We should be aware that giving out 2.5th to 97.5th is such a high variance that it would not be useful. So we will give 25th (lower), 50th (middle), and 75th (upper) price range.
Our objective in this project is to select a best model that can give us price range between 25th and 75th prediction interval. So we have to use Random Forest Quantile Regressor from scikit-garden and it gives us the exact result we want. Random Forest Quantile Regressor takes two arguments 1: input data and 2: quantile.
We will be using Random Forest Quantile Regressor and tuning its hyper parameters using RandomizedSearchCV because GridSearchCV will be computationally expensive
Final Model Evaluation
- R squared and accuracy is both relatively higher than the previous models and the output is what we wanted (25th percentile, 50th percentile, and 75th percentile)
- Model performs better in the lower price range
- Observing the graph we can see that there is clear linear trend and red triangle is the upper prediction interval, blue dot is the mid interval, and green triangles are lower prediction interval
- Once again the struggle in the price 4–5 million stays but this time we are giving out a price range that makes sense to people
- In the end this model is what we wanted, it gives us the lower prediction interval, mid prediction interval, upper prediction interval and we can also change the interval to much higher like 2.5th to 97.5th if the model gets more accurate.
- In short, we cleaned data sets and after cleaning and merging, the remaining data set rows were only 1966 out of 3866. So almost half of the data were not useful and even this amount of data can perform highly accurate, we can conclude that our model’s performance will increase if useful data set size increases.
We are almost done with the project but one thing left “How did I make my web application ?”
I used Dash from Plotly, it is a visualization package that is available in python and it can generate both static html graphs file or a beautiful dashboards on a live server. So it means that our web application is some sort of dashboard that is working on flask. You don’t have to be an expert to deploy your model in a dash plotly, you just need to learn a few scripts and it is a good idea to have a basic knowledge of html and css styling.
After creating dashboard in my local machine, I have deployed in heroku (free web hosting) and there are lots of tutorial out there if you want to make one for your self.