Understand The Performance of Airbnb in Toronto by Using Python.

Understanding the Airbnb situation in Toronto by analyzing data handled for free by the company itself.

Vinicius Porfirio Purgato

Published in

Analytics Vidhya

11 min readFeb 9, 2021

Goal of the Project

Understand the popularity of Airbnb in Toronto.
Analyze the most expensive areas, most popular types of rooms, reviews and price/night average.
Show some of the Airbnb Legal Issues in Toronto.
Teach how to handle missing data and outliers.

Before we start…

If you want to see the full code and understand the methods I used to get my results, make sure to check the notebook for this project on Google Colab, also make sure to follow me on LinkedIn and GitHub.

Introduction to Airbnb

Airbnb is the biggest hotel franchise in the world, the interesting part about it is that they have no hotels at all!

Through an innovative way of finding accommodation, Airbnb connects people who want to travel (and to accommodate somewhere) to house hosts who want to rent their places.

By the end of 2018, ten years after its foundation, Airbnb had already hosted over 300 million people from all over the world, challenging traditional hotel brands to reinvent themselves.

In addition to this, the startup also has a culture of handling free data on the internet, you can download them from the website Inside Airbnb and have access to data from some of the biggest cities in the world, which allows you to build countless projects and Data Science solutions.

Take a look at some of Airbnb numbers:

Intro to The Airbnb Community from Shirley Chen

By the way, this set is the “summarized” version from Airbnb. On the same page, we downloaded the file listing.csv, there is a longer version called listing.csv.gz with the full dataset.

Presenting Toronto

Photo by Author’s father, yes that’s me in the pic

Toronto is Canada’s most populated city, it is located on the shores of Lake Ontario in the province of Ontario, it is home to the Toronto Raptors, Toronto Blue Jays, Drake, University of Toronto, and many other famous people and institutions. Toronto is a very ethnically diverse city with a healthy population of immigrants. Almost half of its population is foreign-born. You can find restaurants from all over the globe. Like any good big city, Toronto doesn’t stop, it is a 24/7 working kind of place. The city is also known as the 6ix, a term popularized by Drake in his songs because he is from Toronto.

The city also got two airports:

Toronto Pearson International Airport (YYZ).
Billy Bishop Toronto City Airport (YTZ).

With a population of 2,600,000 people in its city-area and 6,231,765 in the urban area, it is also the fifth-largest city in North America, after Mexico City, New York, Los Angeles, and Chicago. Toronto is also known as Hollywood North, did you know that 25% of Hollywood movies are filmed in Toronto, either on set or in the streets? Cool, right? The tourism sector in Toronto is also promising. However, it has a couple of Legal Issues for Airbnb:

Short-term rentals are permitted across the city in all housing types in residential and the residential component of mixed-use zones.
People can host short-term rentals in their principal residence only — both homeowners and tenants can participate.
People who live in secondary suites can also participate, as long as the secondary suite is their principal residence.
An entire home can be rented as a short-term rental if the owner/tenant is away — to a maximum of 180 nights per year.
In order to rent your secondary residence, the short-term rental must be greater than 28 nights.

Given the city’s context, let’s start our analysis.

Obtaining the Data

One of the main reasons why Python is great is for how many libraries and packages it has, today we will be using four libraries to analyze our dataset, these are:

1. pandas - Used to manipulate our dataset.2. matplotlib - Used to plot our histograms.3. seaborn - Used to plot our heatmap.4. plotly - Used to plot our interactive map.

Variables dictionary

id — the id number generate to identify the place.
name— name of the place announced.
host_id— the id number of the place’s host.
host_name— host’s name.
neighbourhood_group — this column has no valid data.
neighbourhood — neighbourhood’s name.
latitude — property’s latitude coordinate.
longitude — property’s longitude coordinate.
room_type — type of room offered.
price — price to rent the place.
minimum_nights — minimum amount of nights to book the place.
number_of_reviews — amount of reviews the place has.
last_review — last review’s date.
reviews_per_month — amount of reviews per month.
calculated_host_listings_count — amount of properties the host owns.
availability_365 — amount of available days for booking in a year.

Missing Data and Outliers

As I took a look at our dataset I realized there was data missing and even some huge outliers. Therefore I had to clean it all. To do that, I first decided to check the percentage of missing data. Here is our output:

# ordenating variables with the highest amount of null values in descending order
(df.isnull().sum() / df.shape[0]).sort_values(ascending=False) * 100output:neighbourhood_group               100.000000
reviews_per_month                  22.474678
last_review                        22.474678
host_name                           0.060224
name                                0.005475
availability_365                    0.000000 calculated_host_listings_count      0.000000 
number_of_reviews                   0.000000
minimum_nights                      0.000000 
price                               0.000000 
room_type                           0.000000 
longitude                           0.000000 
latitude                            0.000000 
neighbourhood                       0.000000 
host_id                             0.000000 
id                                  0.000000 
dtype: float64

As you can see, neighbourhood_group has 100% of its value missing, both reviews_per_month and last_review have around 22% of their values missing. host_name and last_review had around 0,1% of missing data.

Okay, now that we know there are missing data, let’s look for outliers. We will be using two methods to do this. First of all let’s plot histograms:

# plot the histogram of the numerical variables
df.hist(bins=15, figsize=(15,10));

As we can see above, most likely we do have outliers. Pay attention to the minimum_nights histogram, it is exceeding the 180 nights allowed by the Toronto City Hall. Let’s check how we can get rid of them.

Finding the outliers

If you take a look at the distribution of the histograms, you may notice the existence of outliers in variables like price, minimum_nights and calculated_host_listings_count. These values don’t follow the distribution and they also mess up the whole graphical representation. There are two quick ways that help us identify the existence of outliers. These are:

Statistical summary using the .describe() function.
Plotting box plots for the variables.

# statistical summary of the variables
df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', "availability_365"]].describe()output:

By analyzing the statistical summary, we can affirm:

The variable price has 75% of its values below 149, however, its maximum value is 13000.
The amount of minimum_nights surpasses 365 days a year.

Boxplot for minimum_nights

You can see below how scattered some dots are from the majority. They certainly are outliers and might be messing with our results.

#minimum_nights box plot
df.minimum_nights.plot(kind='box', vert=False, figsize=(15,3))
plt.show()# checking amount of values over 30 days in minimum_nights column
print("minimum_nights: values over 30:")
print('{} entrances'.format(len(df[df.minimum_nights > 30])))
print('{:.4f}%'.format((len(df[df.minimum_nights > 30]) / df.shape[0])*100))

547 minimum_nights values are exceeding 30 days, which represents 3.8351% of the values.
It’s important to remember that Toronto doesn’t allow short-term rentals to exceed 180 nights a year, but of course, these are the vast minority. Most of the rentals aren’t greater than 30 days.

Boxplot for price

Let’s have a look at how distributed the price is:

# price box plot
df.price.plot(kind='box', vert=False, figsize=(15,3))
plt.show()# checking the amount of values over 1500 in the price column
print("\nprice: values over 1500")
print("{} entrances".format(len(df[df.price >1500])))
print('{:.4f}%'.format((len(df[df.price > 1500]) / df.shape[0])*100))

Since we identified outliers both in price and minimum_nights let's clean our DataFrame and plot those histograms again.

#removing outliers in a new DataFrame
df_clean = df.copy()
df_clean.drop(df_clean[df_clean.price > 1500].index, axis=0, inplace=True)
df_clean.drop(df_clean[df_clean.minimum_nights > 30].index, axis=0, inplace=True)#removing neighbourhood_group, because it's empty
df_clean.drop('neighbourhood_group', axis=1, inplace=True)
df_clean.hist(bins=15, figsize=(15,15));

There we go, now we got a clean DataFrame, so we can analyze the true data.

What’s the average renting price?

Now that we have a clean DataFrame, we can check our average booking price. To do this we can use the .mean() method, in which will print the mean value among all the values in the price column.

# Average of the price column
df_clean.price.mean()
output:124.63443072702331

There it is, we can see how the outliers affected our average, it was saying around CAN$138 in the statistical summary, but in reality, it is around CAN$125.

What Are the Correlations Among the Variables?

Correlation means that two (or more) variables are connected in a certain way. We are looking for connections and similarities among them. These connections can be measured and the correlation coefficient will tell us how connected/related they are. In order to find connections, let’s…

make a correlation matrix.
plot a heatmap using the matrix data, by using the seaborn library.

# making a correlation matrix
corr = df_clean[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
"calculated_host_listings_count", 'availability_365']].corr()#plotitng the heatmap with the matrix data
sns.heatmap(corr, cmap='RdBu', fmt='.2f', square=True, linecolor='White', annot=True);

Blue means related, and the “redder” it gets means there are fewer connections. There isn’t any major relation in our DataFrame, except for reviews_per_month and the number_of_reviews.

By the way, all the 1.00 values in the heat-map don’t mean anything since they are being compared to themselves, for example: availability_365 to availability_365. You can see below how scattered some dots are from the majority. These certainly are outliers and might be messing with our results.

What Type of Places are Rented the Most in Toronto?

The column room_type contains all the kinds of places you can rent in Airbnb, there are quite a few options available. By using the method value.counts() we can check which is the most popular type of Airbnb property renting in Toronto.

# shows the amount of each type of place in the dataset
df_clean.room_type.value_counts()
output:Entire home/apt    10934 
Private room        6199 
Shared room          305 
Hotel room            58 
Name: room_type, dtype: int64# shows the percentage of each type of place in the dataset
df_clean.room_type.value_counts() / df_clean.shape[0] * 100
output:Entire home/apt    62.494284
Private room       35.430956 
Shared room         1.743256 
Hotel room          0.331504 
Name: room_type, dtype: float64

As our analysis shows, the most popular room type in Toronto is the entire home or apartment. Thus if you live in Toronto and want to rent your place, you should probably invest in whole-home/apartment rentals or maybe private rooms.

What’s the average of minimum nights in Toronto?

We can also check the average of minimum nights required for booking in Toronto. To do this we can use the .mean() method. Since we are talking about days, I will round it to the nearest tenth.

# checking the mean of column 'minimum_nights'print(round(df_clean['minimum_nights'].mean()))
output: 8df_clean['minimum_nights'].std()
output: 10.967971072381717

Isn't it weird to have an average of minimum nights equals to 8? It is. That's a lot of nights, over a week. That's why I checked the Standard Deviation of minimum_nights, it returned almost 11, which is pretty high, if you don't know statistics, Standard Deviation is basically a statistic that measures the dispersion of a dataset relative to its mean.

To deal with this let's make a copy of df_clean called df_new just so we don't mess up with our original data and then just check the mean for values smaller than 10 days:

df_new = df_clean.copy()
df_new = df_new[df_new['minimum_nights'].values < 10]
df_new['minimum_nights'].mean()output:
2.294784406831585

As you can see, the average of minimum nights that the hosts are asking in Toronto is 2 nights.

What Is The Most Expensive Location in Toronto?

One way to check one variable against another is to use .groupby(). In this case, we want to compare neighbourhoods from the rental price.

#check average price per neighbourhood
df_clean.groupby(['neighbourhood']).price.mean().sort_values(ascending=False)[:10]neighbourhood
Yonge-St.Clair                       225.000000
Kingsway South                       196.583333
Maple Leaf                           190.333333
Leaside-Bennington                   189.636364
Rosedale-Moore Park                  185.718182
Lawrence Park South                  179.886364
Waterfront Communities-The Island    170.668984
St.Andrew-Windfields                 169.782609
Forest Hill South                    166.000000
Etobicoke West Mall                  160.733333
Name: price, dtype: float64

In this case, we have that Yonge-St. Clair (Deer Park) is the most expensive area in Toronto to book an Airbnb, followed by Kingsway South and Maple Leaf, and although this can be taken as accurate and these are affluent regions in Toronto, sometimes representative data may be misleading. In some cases, you have to take into consideration the number of places available for booking. The fewer there are, the higher the mean could get. However, in this case, Yonge-St. Clair, Kingsway South, Etobicoke, Forest Hill South, and the others are expensive areas in Toronto. Therefore, we can trust this output.

Distribution Map

By the way, since we are provided with latitude and longitude, we can plot a distribution map. Let x=longitude and y=latitude. The redder it gets means it is more expensive.

Now let’s plot an interactive map with all the locations available in our dataset. By the way, this next map has no filter for price, it is just showing the distribution in Toronto. To plot the map I will import the library plotly.

#importing plotly
import plotly.express as px#plotting our map
fig = px.scatter_mapbox(df_clean, lat="latitude", lon="longitude", hover_name="name",color_discrete_sequence=["red"], zoom=10, height=500)fig.update_layout(mapbox_style="open-street-map")fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})fig.show()

Plot by Author

As you can see, Toronto is big and full of places available for booking.

Conclusion

As we come to an end, we learned data isn’t always perfect. We had lots of missing values that had to be cleaned since they turned out to be outliers, making our results and outputs differ a lot from the reality.

We also learned that sometimes the outputs are misleading and non-representative, causing the data to be distorted.

Even though we had great insights, this is just a summarized version of the real dataset. In order to truly explore the data, it would be great to have the full version of this csv which holds way more variables and attributes.

By the way, Toronto is a great city, I highly encourage you to visit, don’t forget to buy some timmies and to have fun, eh?

Did you like this article?

If you did like this article, please follow me on LinkedIn and GitHub, and if you got any questions don’t hesitate to reach out.