Predicting the Price of Resale HDB Flat using Linear Regression
Metis Bootcamp Project 2
For the second project of Metis Bootcamp, I worked on applying Linear Regression to predict the price of Resale HDB Flat. As HDB Flat in Singapore is expensive and usually cost an average of $300k for a 3 room resale flat, $400K for a 4 room resale flat and $500k for a 5 room resale flat in a matured estate for. This project requires webscraping to be done as a requirement and beautifulsoup and selenium were used to scrap the data from data.gov.sg
After scraping the data, exploration data analysis was done and we have 10 features and 1 target (Resale price).
Features:
1. Month (of transaction)
2. Town
3. Flat type
4. Block
5. Street name
6. Storey range
7. Floor area (in sq metres)
8. Flat model
9. Lease commence date
10. Remaining Lease
Target:
1. Resale Price
Next, heatmap was used to find the correlation for the variables. In linear regression, all the input variables must be independent from each other. We eliminate those variables that are correlated to each other. For this project, floor area correlates with flat type and lease commencement date correlates with remaining lease years. Flat type and lease commencement date are removed from the input variables.
For the target which is the resale price, the data distribution is skewed to the right so we apply log transformation on it to achieve a normal distribution so that we can get a better fit using linear regression.
Resale Price Plot
After applying log transformation
Log Resale Price Plot
We categorise the towns into 5 different regions Central, East, North, North-East and West and created dummy variables and drop North-East since one variable is not necessary.
We arrive at a model with the following features : floor area,remaining lease years, storey range, region Central, region_East, region_North,region_West and a R-squared of 0.817 and adjusted adjusted R-squared of 0.816 to predict the resale price.
We used 60 percent of the dataset for training, 20 percent for validation and 20 another percent for final testing.
3 models - Linear Regression, Ridge Regression and degree 2 Polynomial were used for validation and testing.
The validation results are as follow:
Linear Regression 0.813
Ridge Regression 0.813
Degree 2 Polynomial 0.847
Cross Validation Results :
Linear Regression mean 0.813 +-0.021
Ridge Regression mean 0.813+-0.021
Degree 2 Polynomial mean 0.846+-0.012
The final testing results are as follow:
Linear Regression 0.818
Ridge Regression 0.818
Degree 2 Polynomial 0.818
The programming code can be found on https://github.com/victortan83/project2