AirBnB Listings Data — Toronto, October 2018
In this blog post, we will learn about Toronto neighborhoods through the lens of AirBnB listings. In the process of doing so, we will be answering some exploratory questions about these neighborhoods, for example:
- Which neighborhoods have the most expensive price per bed?
- Which neighborhoods have the highest number of listings?
- Which property types are the most listed in AirBnB in Toronto?
- What’s the average listing price by property types?
- What’s the average price per bed price by property type?
- What does the occupancy rate look like by neighborhood?
From there, we will attempt to answer some specific questions, such as:
- What factors affect the price the most?
- Can the price be accurately predicted, given other information about the listing?
The reason for creating this project is to potentially build an application for helping Airbnb hosts price their listings in an easy and data-driven way.
Imagine being a new host on Airbnb, having a tool that does the required analysis for you in terms of the neighborhood that you are in as well as information about the property, would get you up and running quickly.
Another reason is curiosity about the effects of applications such as AirBnB on the city’s rent prices but is outside the scope of this project.
Edit: The code has been removed from the blog post itself in favour of making it more reading. If you would like a walk-through of the code, you can view the Github repo linked below for all the resources. Or view the iPython notebook which accompanies this article.
In data science, a quick way to explore a dataset is to try and visualize some trends about major data points (i.e., features).
The data for this article can be found on the insideairbnb webpage.
Graph for price distribution (with outliers)
A quick glance shows that most prices are less than $700; however, we do have some outliers, such as listings priced as high as $12000 (but how?? who pays for this??!).
In the next graph, we will remove those outliers and plot the distribution of the price again.
Graph for price distribution (no outliers)
We can deal with outliers by calculating the standard devation of the all the price values. It is standard practice to consider values above three times the standard deviation to be outliers.
Map of AirBnB listing clusters in Toronto
We can also visually inspect clusters of listings on a map. It is clear that the highest number of postings are around the core of Downtown Toronto.
In the map below, the different colours represent different property types. Most common property types are:
- Entire Apartment / Home [orange]
- Private Room [green]
- Shared Room [blue — very few comparatively]
We will begin by asking some general questions to extract some insights from our dataset.
Which 5 neighbourhoods have the most expensive listing prices on average?
We can get the average listing price if we group by Neighbourhood and then sum the price column.
We have visualize the results of the top 40 neighbourhoods below:
Which 5 neighbourhoods have the highest price per bed?
We can also estimate a “price per bed” value which we derive by dividing the price by the number of beds.
Note that some units, such as Bachelor or Studio appartments have “0 beds” listed. In those cases, we can assume 0 beds to mean 1 bed because the division above would not work otherwise. Alternatively, you can use a default value such as 0.1 instead of 1.
Results ordered by Average Price per Bed
We have visualize the results of the top 40 neighbourhoods below:
Which property types are the most listed in AirBnB in Toronto?
The graph below illustrates the percentage of each property type listing in Toronto on AirBnB. We have limited the property types to the top 10 as the other types are too infrequent (less than 0.5% of listings).
What’s the average listing price by property types?
What’s the average price per bed price by property type?
Which 10 neighbourhoods have the highest number of listings?
How can we estimate the occupancy rate in a listing?
The occupancy rates are not provided in the dataset but can be estimated through a heuristic, which we will define below.
Let’s consider some of the relevant points available in the dataset:
- Reviews per month: insights into frequency of visits of the listing
- Minimum nights: indicator of minimum stay length, to be used with the number of monthly reviews
- Availability 365: an indicator of the total number of days the listing is available for during the year (i.e. if all the available days are rented out, then the listing’s occupancy rate is 100%)
A possibly naive formula for estimating the occupancy rate could be as follows:
occupancy = ((reviews per month * min. nights) * 12) / availability_365
Note that the intended goal of the above formula is to give us a value between 0 and 1. However, in practice this will not work for some listings.
Bigger Questions & Data Modeling
What factors effect the price the most?
To gain insights into factors which have a big influence on the price, we can build a Linear Regression model and then later inspect the coefficients of that model.
Below is the function used to split the data, split it into 2 sets (train and test), and then fit the model with our training data.
Below are the 5 top influencial factors in our linear model.
A clear factor is the neighbourhood which a listing is in. Another factor, which indicates lower prices, is having a shared room.
We can inspect the distributions of both the test set labels as well as our predictions to visualize how our model fitted the data.
This is the distribution plot for the true prices in our test set. We can see that the fitted line is not perfectly smooth.
The second graph is the distribution plot of the predicted prices (with Linear Regression).
We can see that our model over estimates the number of properties in the $100 to $200 range. The model is also under estimating the number of higher price listings.
This R2 score can be interpreted in different way, which is to say that a score of 1 could actually indicate that we over-fitted the model. Here’s a great article about R2 scores cautions.
We can conclude that the XGBoost model is overfitting the training data that it has seen before well but did not generalize as well as expected. This could be due to the choices of hyper-parameters.
The next step to to try various machine learning pipelines that can help us potentially reduce the number of features we are dealing with or capture non-linear relationships in the data. An update will be posted in this article once the modeling process has completed.
Here’s the Github repo containing all the code discussed above as well as some snippets that were left out of the article.