As a data driven person, I have always wanted to analyze real estate data in Toronto because I knew such piece of work would be so useful for many people.
Unlike in the US, it has been notoriously difficult for an average home buyer in Toronto, to get their hands on clean data for real estate transactions. The Toronto Real Estate Board (TREB) continually fought to keep transaction data from being accessible to the public, and we always had to go through a broker to get the details.
This changed last year, when the Supreme Court of Canada ordered TREB to release the data. Since I was looking to buy a house in the near future, I decided to write this article from a buyers prospective.
My family and I started looking for new houses in a particular neighborhood and we were trying to do the initial viewings on our own. An agent is usually helpful, but as a data analyst, I thought we could do things a bit differently.
With a budget in mind, we would often filter for houses that were listed under a certain amount. When doing this, we soon found that many houses were intentionally listed too high (as the starting anchor for the bargaining process), or too low (to trigger a bidding war). While both of these selling strategies have their merits, this did make it difficult for us to figure out whether it was worth the time to look at a particular house or not. Often times we would go to a house that we liked, only to be told by the agent that the sellers had no intention of selling anywhere close to the listing price. Other times we would miss looking at the listing because we thought it was out of our price range.
At first we tried using a popular website called HouseSigma. It is an excellent place for house searching in Toronto but we found that the price estimates for active listings were not suitable for our purposes (influenced too much by the listing price).
To make the process easier, I decided to collect some recent sold data in the area I was interested in and do some analyses to see what type of house was within my budget range. I also believed that if we did start to negotiate, the analysis would also come in handy.
Although Toronto real estate transaction data is now available to the public, it is still difficult to get all the data for the entire city in a format suitable for data analysis (like an Excel spreadsheet). Therefore I ended up only collecting the last six months of data for the areas that we were personally interested in. I limited the housing type to be detached houses as well. This gave me information on 84 real estate transactions to analyze.
Yes that seems like a small dataset, but I think looking at what 84 houses sold at in a small physical location, within a specific timeframe could give us good estimates for our purposes. It would have been great to have looked at different areas and historical price trends, but the data just wasn’t available to me at the time.
Here are the list of variables that I collected in each transaction:
- Date — the date of the sales transaction. Apr. 2019 to September 2019.
- Address — the address of the house.
- Price — the amount of the sales transaction.
- Bedrooms (upper) — the number of bedrooms above the basement.
- Bedrooms (lower) — the number of bedrooms in the basement.
- Bathrooms — the number of bathrooms.
- Garage — the number of cars that can fit into the garage. Zero if no garage.
- Tax — the amount of annual property tax on the house for the previous year. Zero if it is a new house (number of new houses = 7).
- Lot width — the width of the lot that the house is built on.
- Lot Length — the length of the lot that the house is built on.
- Floors — the number of floors in the house.
- Front facing — the direction that the front door is facing.
- School — the school rating (average of elementary and high school) of the area of the house (worst=1, best=10). In Toronto, the public schools that children attend depend on where they live.
- Corner Lot — whether the house is on a corner lot.
There were a few things that caught my eye when looking at the data:
- The area we were looking at is expensive! The median house price is $1.67 million. I had read that the overall average detached house in Toronto was in the $1.2–1.3 million range. I’m starting to wonder if there are more affordable areas in the city that would also suit our needs.
- The school rating is above average. I’m not sure if it’s because of the superior education the school provides or because of the abilities/preparation of the students attending.
- It’s interesting that most of the houses face the North-South direction. I know that my family prefers this configuration, so perhaps this was by design? It will be interesting to see if this affects the price (more on this later).
Building a predictive model
We want to develop a tool in order to predict the final selling price for the houses that we are looking at. If the price is within our budget, we may spend more effort to look into it further (like actually visiting the place).
84 observations is not usually considered a large dataset for a machine learning model, but it should give decent estimates if they are recent and restricted to a certain area. Think about if you were a real estate agent specializing in a specific area and saw 84 transactions done for detached houses. You would probably get a good idea as to what the houses would sell for. There were also some features that I wish we had (like square footage) but wasn’t easily accessible.
There were a couple of approaches that I thought of using, but ended up deciding on using a gradient boosting machine (GBM) model. I won’t go into the details as to why I chose this, but it seemed like the simplest option and often gives good accuracy for such problems.
Evaluating the model
We need to be careful when evaluating a model with such a small amount of data. I wanted to get a realistic estimate of the bias and variance of the predicted prices the model was spitting out. Usually we would hold-out a test set and do a cross-validation on the rest of the data. But since the data is smaller I opted not to use a test set. I just selected one set of hyperparameters (no optimizing) and used leave-one-out cross validation. This trades off hyperparameter optimization with getting a better evaluation of the model.
Here is the resulting evaluation of the model:
A lot of room for improvement, but it was good enough to work with for our purposes.
What is the model telling us?
Knowing what the model is really doing is useful in many cases. In applications like credit underwriting or pricing there could be ethical issues with the model if you aren’t careful with what features are going in. For our purposes, I just wanted to know if things were making sense. Many times you can find errors in your data or some insight as to whether to drop some features or not. It’s not always possible to interpret your model, but if you can, it’s a really good thing to do.
For models like GBMs, feature importance and partial dependency plots are useful for interpretation.
Here is the feature importance plot for our model:
The most important feature to the model is… number of bathrooms??? Now that is interesting. My theory is that the number of bathrooms more accurately reflect the size/capacity of the house. Maybe even the age of the house or whether it has been renovated recently. It still caught me by surprise that bathrooms was more important than bedrooms, but I guess it does make sense.
Let’s take a look at the partial dependence plot of bathrooms:
One can see that there is a large jump from 4 to 5 bathrooms. What this chart is telling us is that on average, going from 4 to 5 bathrooms increases the price of the house from around $1.7M to $2.5M an $800k difference!
This is a good example of the saying “correlation does not imply causation”. Adding a lot of bathrooms while keeping everything else about your house the same, will not improve the value of your home by $800K. The number of bathrooms is likely correlated with the real causal factor (probably size) for price. That being said I think we can keep this variable in the model for it is still useful for predicting price.
The next two important features are tax and lot size — this is good because common sense would indicate that they would be closely associated with the selling price.
Corner lot and fronting direction were two features that we personally found important, but did not seem to be associated with the price very much. This is a good thing, because we don’t need to pay extra for things that we like :)
Overall, this project was a fun and useful one. It does give us more confidence that the houses we are visiting are in our budget range and will save us time in the future. It will be interesting to test this model out of sample to see how it continues to perform.
It would have been better to get more data in different areas from further back so we could do more analysis on trends around the city, but the access to data isn’t there yet. Hopefully we can find a way to do this in the future.
Happy house searching!