Predicting House Prices with Linear Regression

Ash
4 min readMar 11, 2019

--

Project Overview

I recently finished the Linear Regression section of Professor Andrew Ng’s Machine Learning MOOC, and wanted to apply what I’d learned by incorporating Machine Learning into a “real” project.

After some brainstorming, I decided to build a REST API that predicts housing prices in 4 geographies: Vancouver, Toronto, New York, and San Francisco.

You can check out the sample app here (GitHub), the REST API here (GitHub), and a demo of the app directly below.

REST API Demo

Curl Demo

Sample App Demo

Building the App

Planning Phase

Before writing a single line of code, I made a technical sketch of the project using Draw.io:

Image A: A technical sketch of the project. The numbering relates to the order in which each component was built.

The following is a list of all the components in Image A and their respective functionality:

  • House Price Prediction Model- A Multivariate Linear Regression model written with scikit-learn that predicts home prices on the basis of size (square footage), number of bedroom(s), and number of bathroom(s).
  • Data Sources- Data from 4 different websites (1 per geography) are used as training data for the model. Their contents are scraped once every 24 hours to continuously improve the model’s accuracy.
  • Python Data Scraping Script- A Beautiful Soup based web-scraper configured to scrape the 4 aforementioned websites. After each scrape, data is fed into a Heroku Postgres database for persistence. The script is scheduled to run every 24 hours to continuously add new data to the database.
  • REST API- An API written in Flask that allows 3rd party applications to query the ML model via a GET request.
  • Sample App- A simple app created to demonstrate usage of the ML Model’s REST API.

Building the House Price Prediction Model

Python’s scikit-learn library makes writing ML models fairly easy. For example, creating a linear regression model whose features and labels are in arrays feature_array and price_usd (respectively) takes just 2 lines of code:

Scikit-learn models are serializable using Python’s pickle library, which allows them to be saved to the hard drive once they are trained. This means that if the training data is ever deleted (accidentally or otherwise), I’ll still have a copy of the model to fall back on.

Writing the Data Scraping Script

Writing the data scraping script was straightforward; for all 4 websites I used the technique outlined by Data Science Dojo in this video. Namely, I grabbed html from the given website using urllib.request, then parsed & searched through it using Beautiful Soup to get a listing’s price (in USD), size (in square footage), number of bedrooms, and number of bathrooms:

If there was ever a listing that didn’t have any of the 4 previous qualities listed, I skipped it (see line 14 above).

After collecting all of the listings from a given website, I bulk-inserted them into a Heroku Postgres database using psycopg:

Creating the REST API

The REST API requires 4 url parameters: House Size, Number of Bedrooms, Number of Bathrooms, and Location. Thankfully, Flask makes it incredibly easy to accept the GET Request and parse its parameters:

Once the parameters are extracted, I use them to query the model to get the predicted price. Then, after validating parameters (i.e. making sure they aren’t negative), I respond to the GET Request with the predicted price.

Building the Sample App Script

Demoing the REST API using curl is useful to a technical audience, however it doesn’t communicate anything for the layman. So, to make the demo more accessible, I also created a simple webapp that queries the API with user-generated parameters (i.e. home size, etc.), then displays the predicted home price:

Conclusion: Lessons Learned & Potential Future Improvements

  • I found the 10,000 row limit on the free-tier of Heroku Postgres to be extremely limiting; I had to delete old data several times a week to free up space for new data. In the future, I could look into creating 4 different tables, 1 for each geography (increasing my capacity from 10,000 rows to 40,000 rows).
  • From my understanding the REST API I’ve build can be classified as a monolith, and so I can look into breaking it up into smaller microservices in the future.
  • Once I’ve gotten more comfortable with Machine Learning, I could swap out the off-the-shelf ML Model I got from scikit-learn for a custom Model using Tensorflow.
  • At the moment the REST API has no authentication mechanism (i.e. anyone can query it). Although I doubt it’ll see enough traffic to justify any level of security, I could secure the API using OAuth.
  • Linear Regression is susceptible to outliers (i.e. uncharacteristically overpriced/underpriced houses) especially when, as in my case, the dataset is small. To avoid this distortion, I could preprocess my data by removing outliers before building the model.

--

--