End to End Machine Learning Project
The best way to enhance one’s skill in a particular field is by practicing that particular skill by using that skill in a real world scenario. I have tried to use my skill by aiming to create a web application which gives an estimate of the rent prices in a particular locality of a particular city based on the inputs given by the user using machine learning models trained for that particular city.
Motivation behind the project
By far and large, I had noticed that there isn’t much work done in the field of real estate using machine learning as far as Indian scenario is concerned and the websites which exist like magicbricks.com, makaan.com etc are way too granular and require the user to give a lot of input which the user who is planning to migrate to a particular city may not know.
The main motivation behind the project was to create a web app which uses machine learning and gives a good estimate of the rent prices according to the inputs given. The main focus of this web app was to provide a simple user interface along with accurate results.
Fetching the data
For the purpose of this project, I have used the dataset from Kaggle. This dataset contains housing prices for 8 different cities in India
Initial set up
For the purpose of this project, I have used two resources from the free tier account from AWS
- Free tier t2.micro instance from EC2 for maintaining a server
- Free tier RDS Database with minimal configurations and disabled auto back ups for maintaining a dynamic database on the cloud
You may want to take care of the following points while creating the resources
Authorizing the public IP address of your personal machine and the server you have created on google cloud in the SQL database so that you can connect from the PC or server
For the purpose of this project, I have used the _All_Cities_Cleaned.csv file which was available in the dataset from Kaggle
Although this file was cleaned, it still required further preprocessing.
After cleaning and preprocessing the file, I created 2 SQL files which contain insert queries for SQL so that the data can be read dynamically and the models can be updated accordingly.
Initially I needed to run the SQL files in MySQL workbench to load the data but then, the inputs from the users were inserted into the table by using the insert queries so that the model trains on the updated data
Python code for preprocessing
Exploratory Data Analysis
For the purpose of EDA, I have loaded the cleaned and preprocessed data from SQL.
Then I appended the city column to each DataFrame to denote the city which the data was from and Affordability column (which was given by price/area) to denote the affordability of houses in each city
Then I proceeded to analyze the number of houses rented in each city and found out that most houses were being rented in Mumbai, Delhi and Pune maybe because Mumbai is the financial capital of India, Delhi is the political capital of India and Pune is famous as Oxford of the East for its educational institutes
Then I plotted the average price of houses in each city to find out which city had the most expensive houses and I found out that Delhi and Mumbai had the most expensive houses
Then I decided to plot the average area of houses in each city to find out whether the houses in each city are priced appropriately according to the area. I found out that the houses in Delhi, Ahmedabad, and Hyderabad are the most spacious houses
After plotting the prices and areas of houses in each city, I decided to plot the affordability of houses in each city to find out the most affordable cities in the dataset, the lesser the price per square feet, more affordable the houses in that city are. After plotting the affordability of houses in each city, I found out that Ahmedabad, Kolkata and Hyderabad are the most affordable cities in the dataset
Then to analyze the data at a deeper level, I plotted the categorical/textual columns [‘SELLER TYPE’,’LAYOUT TYPE’,’PROPERTY TYPE’,’FURNISH TYPE’] as a pie chart to see the proportion of each category of each column in each city as a 2x2 plot with text annotated on the side
Then I decided to plot the numerical columns as 2x2 grid where in the top row, there were distributions of price and area of houses in that city and in the bottom row, there were the histograms of the number of bedrooms and number of bathrooms in each city.
Then I decided to plot 10 most affordable localities and 10 least affordable localities in each city side by side. The criteria for most and least affordable localities was the average of the affordability column in the data of that particular city grouped by the locality
Then I decided to plot 10 most spacious localities and 10 least spacious localities in each city side by side. The criteria for most and least spacious localities was the average of the area column in the data of that particular city grouped by the locality.
Python code for Exploratory Data Analysis
Now that we have preprocessed and analyzed the data, we are now ready to move forward to the main element of the project which is building the Machine Learning model which will then power our web app in the backend.
For the purpose of this project, since the problem is a regression problem, I have analyzed my model on the basis of R2 score and Mean Absolute Error
I have tried the following models for this project
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- Adaboost Regression
- Gradient Boost Regression
- XGBoost Regression
From the following models, I found out that XGBoost Regressor was the model which had the least Mean Absolute Error and the most R2 score on both train and test sets
Since XGBoost was the best model, we will try hyperparameter tuning on XGBoost Regressor model.
After trying hyperparameter tuning, we found that the validated model was not showing much improvement, hence we will use the original XGBoost model
Python code for model building
Putting all the components together
Since now we have created the models, we will now create a web app with various endpoints to show the analysis and information about each city to the end users and will provide a simple user interface with our accurate Machine Learning models.
Python code for creating the web app using Flask
Retraining the model
Since now we have trained the model once, the model needs to be continuously retrained on new data every month, for that I have created a python script which retrains the model and overwrites the updated graphs
Python code for retraining the model
For deploying the model, I created a server on Linode and deployed the app using nginx and gunicorn and then linked it to a domain using namecheap.
For getting a domain, you need to buy a domain from any domain provider, then configure the nameservers according to the server provider you are using and then you need to configure the DNS records to point the domain to your server (basically you need to configure for 2 hosts — www and blank host so that if anyone enters www.YOUR_DOMAIN_NAME.com or YOUR_DOMAIN_NAME.com, the user is redirected to the IP address of your server)
For obtaining the SSL certificates, I used the free non-profit certificate provider Let’s Encrypt. Then for retraining the model every month, I used the crontab utility available in Ubuntu.
Here are some commands that I used