Analyzing the AirBnB Dataset for trends using Data Visualizations and Modeling
In this post, I will be analyzing the AirBnB Dataset using visualizations and learning models. For analysis, I will follow the CRISP-DM process, on data from Seattle.
I want to answer the following questions from my data exploration:
- What distinguishes hosts that have Superhost status? Do all Superhosts properly qualify the criteria that AirBnB has set for them?
- What time of the year are AirBnBs most popular in Seattle? Are specific holiday seasons more popular?
- What are the most important characteristics of a listing, and how do they influence price?
These questions will help us with Business Understanding of the domain.
To help us understand the data, let’s load in the Dataset, and explore it’s characteristics through visualizations.
Exploring the Data
We will first look at cross-plots between a sample of variables in the Listings
. We can see that many of the variables are well correlated, such as between price
and reviews per month
, and number of reviews
and count of host listings
.
To get a better understanding of how the attributes are correlated in Listings, we plot a Correlation plot. We see that many attributes are well correlated, such as the features of a house (ie. bathrooms, beds), or different review types.
We can also explore specific attributes to understand the distribution of their values. The above cross plot groups all price values together, so we should look at that separately, and we can look at categorical features which weren’t plotted.
The above plot shows that price is distributed normally, and some neighborhoods have many more houses listed in them than others, similar to Room Type. This last plot shows that individual houses are much more likely to be listed on AirBnB than shared ones.
Preparing the Data
To be able to answer our questions, we need to prepare the dataset, by cleaning missing values by removing or imputing them, joining the datasets together, and encoding categorical columns.
The first step in our process will be to remove all columns with more than 50% missing values. It will be difficult to impute these values since most of the attribute values will be guessed. Next, for the attributes with more than 30% missing values, we will impute them individually. These are Host Response Rate
, Host Response Time
, Notes
, Access
, and Transit
. The last 3 are freeform text values. Since we will not be doing any natural language processing, we can drop these.
Next, we look at all categorical attributes. The categorical variables are varied in type, as we see below. There are freeform texts, date types, booleans, urls, arrays, and currencies. However, all of these are typed as String types. To handle cleaning of categorical attributes, we will look at them individually.
We are going to handle each of these numerical cases individually,
- Date Attributes — Factorize dates into a numbered sequence (
pd.factorize
) - Bool Attributes — Replace with integer representation, 0 or 1 (
pd.str.replace
) - Array Attributes — Encode with dummy variables for each unique value (
pd.str.get_dummies
) - Currency Attributes — Trim
$
and,
symbols from values and store as integers (pd.str.replace
) - Full Text Attributes — For columns that have a small set of unique values, replace with dummy variables (
pd.get_dummies
) - Remaining Full Text Attributes — These columns will all be attributes that have natural language, and we don’t have a need for them. So we drop them (
pd.drop
)
Finally, we take care of all numerical attributes. Since we have already handled all features that had more than 30% missing values, remaining attributes have fewer values missing. Analysis showed that most of these are within 5%. We will handle these by replacing these values with their corresponding mode (pd.Series.mode
).
Now that we have a cleaned dataset, we can explore it and see how some of the attributes correlate. This might be a good time to see the correlation between the features 1. Host Response Rate and 2. Review Scores Rating. According to AirBnB’s official guidelines on their website, these are used to determine whether a host would become Superhost (along with some more specific criteria, for which we don’t have the data to verify). We can plot these together to see how they correlate:
The results are really interesting. We see a large cluster of points at the extreme end of both categories, showing that users that have a good response rate are more likely to have a better review rating. We also see that only a small number of people in that cluster qualify as a superhost, showing that the status is very restrictive. Non-superhosts that are in the cluster probably disqualify from some of the other requirements, such as maintaining no cancellations.
However, we also see that there are superhosts that do not meet the requirements for the status, specifically the response rate. This indicates that the criteria might not be completely enforced, and that it is possible to be a superhost even when you don’t qualify within the official rules. This answers our 2nd question.
Next, we look at the 2nd dataset, Calendar. This will be a simple three step process: 1. Factorize dates into integers, 2. Replace boolean ‘available’ column with 0 and 1, and 3. Remove symbols from price and make them into integers.
Let’s go one step further and add more information to the Calendar dataset. Since all rows are tagged with a date, we can append a column about if a Holiday exists on that day. For this, we can use the Federal Holidays dataset.
Because holidays are a very rare event, a majority of days will actually not be tagged with one at all. There are two issues I’ll address here. First, many people are likely to reserve AirBnBs closer to a Holiday, and not just for a single day, so it is my hypothesis that prices are likely to increase on days that fall near a Holiday day. In this case, we should tag 5 days above and below the actual holiday with the event as well. Secondly, for days without an event at all, we should add a tag indicating it.
Let’s explore this expanded Calendar dataset. We can first group it by date, and see how likely are houses to be reserved on that day.
By grouping this dataset by Holidays, we can see how prices range at those events. There are some interesting trends here. First of all, because the dataset is biased towards “No Holiday”, the price range for it is more varied. However, we also see that certain holidays influence prices, such as a generally high increase near Christmas, New Year’s, and MLK Jr. Day. We see another increase near Independence Day as well, and a dip near less celebrated holidays, like Columbus day. This answers our first question, that for certain days and seasons of the year, prices are generally higher in Seattle.
Modeling Data
With cleaned and processed data, we can now go ahead to the next step in the CRISP-DM process, Modeling. Let’s define our problem statement, what we are trying to predict, and how we tend to achieve it.
Problem Statement: Which attributes influence price, and how relevant are they to it?
What we are predicting: Price for a listing.
How we can achieve this: Since the target is numerical, we can use Regression, and because we have a large number of attributes, we can use dimensionality reduction to find the most relevant attributes.
First, we divide the data into X and Y, along with a Train-Test Split. We then train it on a model and predict values on it. For regression, we will use the RandomForestRegressor, because it is an ensemble technique. With the large number of attributes we have, training multiple trees will give a good average of multiple training tests.
For training and testing, we get the following Mean Squared Errors: 53.441
and 139.667
. Let’s see if we can reduce the error using dimensionality reduction with Principal Components Analysis.
Results
For training, we get 35.735
and for testing, 137.302
. After trying a different range of total principal components, 100 Principal components gives the best MSE. 100 features are still a significant reduction from the original 912 features we obtained after cleaning. We can use the results of Principal Components Analysis to look at which attributes contribute the most information.
Explained Variance by principal components can be obtained by pca.explained_variance_ratio_[:3]
. These are
- 0.357
- 0.239
- 0.226
Let’s look at these three principal components in detail. For principal component 1, we would get these by:
We can see that the most important features that influence price are Host_since
, maximum_nights
, security_deposit
, availability
. 2nd and 3rd PCs show contribution of attributes like host_identity
, amenities
, and calculated_host_listings_count
. This answers our 3rd question about which characteristics of a listing best indicate price.
Deploying
Finally, as we complete our project, the finished code has been posted on GitHub here. The project can be run as a Jupyter Notebook. The ReadMe file provides more information on installation and running.
Through our data analysis, visualizations, and modeling, we answered three questions about the AirBnB dataset.
- The Superhost requirements published by AirBnB are not always critically enforced, with some hosts qualifying for the status even without meeting all the criteria.
- Average prices change by the season, and holidays influence prices as well. There are increases in times near Christmas, and New Year’s as availability goes down.
- The characteristics that most influence price for listings are how long a host has been active, availability of listings, amenities provided, total nights included, and security deposit. These qualities help distinguish houses with low prices from those that have high prices.
These findings provide a lot of interesting insights into the world of AirBnB hosting.
Now two questions remain, how well do these findings apply to other cities, such as yours?
And, how can YOU use this information to monetize your listings better?