Predicting the Price of Resale HDB Flat using Linear Regression

Metis Bootcamp Project 2

Tan Boon Kiat Victor
3 min readNov 15, 2019
Photo by Max Oh on Unsplash

For the second project of Metis Bootcamp, I worked on applying Linear Regression to predict the price of Resale HDB Flat. As HDB Flat in Singapore is expensive and usually cost an average of $300k for a 3 room resale flat, $400K for a 4 room resale flat and $500k for a 5 room resale flat in a matured estate for. This project requires webscraping to be done as a requirement and beautifulsoup and selenium were used to scrap the data from data.gov.sg

After scraping the data, exploration data analysis was done and we have 10 features and 1 target (Resale price).

Features:

1. Month (of transaction)

2. Town

3. Flat type

4. Block

5. Street name

6. Storey range

7. Floor area (in sq metres)

8. Flat model

9. Lease commence date

10. Remaining Lease

Target:

1. Resale Price

Next, heatmap was used to find the correlation for the variables. In linear regression, all the input variables must be independent from each other. We eliminate those variables that are correlated to each other. For this project, floor area correlates with flat type and lease commencement date correlates with remaining lease years. Flat type and lease commencement date are removed from the input variables.

For the target which is the resale price, the data distribution is skewed to the right so we apply log transformation on it to achieve a normal distribution so that we can get a better fit using linear regression.

Resale Price Plot

After applying log transformation

Log Resale Price Plot

We categorise the towns into 5 different regions Central, East, North, North-East and West and created dummy variables and drop North-East since one variable is not necessary.

We arrive at a model with the following features : floor area,remaining lease years, storey range, region Central, region_East, region_North,region_West and a R-squared of 0.817 and adjusted adjusted R-squared of 0.816 to predict the resale price.

We used 60 percent of the dataset for training, 20 percent for validation and 20 another percent for final testing.

3 models - Linear Regression, Ridge Regression and degree 2 Polynomial were used for validation and testing.

The validation results are as follow:

Linear Regression 0.813

Ridge Regression 0.813

Degree 2 Polynomial 0.847

Cross Validation Results :

Linear Regression mean 0.813 +-0.021

Ridge Regression mean 0.813+-0.021

Degree 2 Polynomial mean 0.846+-0.012

The final testing results are as follow:

Linear Regression 0.818

Ridge Regression 0.818

Degree 2 Polynomial 0.818

The programming code can be found on https://github.com/victortan83/project2

--

--