Analyzing Seattle Airbnb Datasets

Agarwal Animesh
6 min readJul 1, 2020

--

In this post, I will be analyzing the Airbnb Seattle datasets available on Kaggle. There are three csv files that have been provided:

  • listings.csv: This dataset has all the airbnb listings in Seattle and the details corresponding to each listing.
  • reviews.csv: This dataset has all the reviews listed for each listing.
  • calendar.csv: This dataset has day wise occupancy and price for each listing for the year 2016.

I will be analyzing the datasets to answer three pertinent questions.

  1. Seasonal trends of price/occupancy in Seattle: What are the most/least crowded months for visiting Seattle? Which months have the most/least expensive listings?
  2. Neighbourhood trends in Seattle: Which neighbourhoods have the maximum listings? Which neighbourhoods are the most/least expensive? How does the availability in different neighbourhoods vary at different time periods in a year?
  3. Can a predictive model be constructed for predicting the price of a listing from the different attributes in the datasets ? How good is the predictive model? What are some important features that are indicative of the price of the listing?

Seasonal trends of price/occupancy in Seattle

We can use the calendar as well as the reviews data to look into the seasonal trends of listing prices and occupancies in Seattle. We first look at the total number of comments in each month which could indirectly provide information on the number of people visiting Seattle across different seasons. Below is the distribution of total number of comments received across the year.

Total number of comments per month

The plot shows that from January to May, the occupancy is relatively low. The occupancy starts increasing from June onwards with peak hitting in August.

Since the calendar data has availability as well as the price for every listing and every day in 2016, we will use it to construct a time series plot showing the average occupancy as well as average price for every single day in 2016. We could then deduce the seasonal patterns from this time series plot. Below is the time series showing the mean occupancy and price for every single day in 2016.

Time Series showing the occupancy and price

This result contradicts our previous findings that January has low occupancy. In fact, the occupancy is quite high in January which gradually decreases till April. One reason could be that the prices are at an all time low in January, 2016. The occupancy increases from April onwards and hits a peak in August/September when the prices are also at an all time high. Post September, the occupancy as well as the prices both decrease. The high occupancy during the summer season corroborates our previous findings that were based on the reviews dataset. However, there is no explanation as to why occupancy is at an all time high in January 2016, based on the information available to us. This could be specific to 2016. We need more data to gain more insights into this phenomenon.

Neighbourhood trends in Seattle

We make use of the listings data and the neighbourhood_group_cleansed column to gain insights into the different neighbourhoods and how the availability/price vary across a year. Below is the mean/standard deviation of the price corresponding to each neighbourhood.

From the analysis, Magnolia, Queen Anne, Downtown are three most expensive places to rent while Lake City, Rainier Valley, Northgate are the three cheapest places.

We now use the listings dataset as well as the calendars dataset to look into the neighbourhood occupancy and prices at different periods in a year, since we have a time stamp stored in calendars dataset. We merge these two datasets using pandas merge function. Below is a heat map showing the occupancy in different neighbourhoods across different months.

It can be seen from the heat map above that during the peak season (July-Septmeber), people prefer to stay in neighbourhoods like Cascade, Seward Park, Downtown, Queen Anne, while places like University District, Magnolia, Delridge have similar occupancies all throughout the year. Interbay has the least occupancy all throughout the year. Next we repeat the same analysis by looking at the price trends.

The heap map above shows that prices in Beacon Hill, Delridge, Northgate, Lake City remain low through the year while the prices in Downtown, Queen Anne, Magnolia are quite high with peak prices around the summer.

Predictive Modelling

If we could predict price of the listing based on its features, it could be beneficial to the visitor. The visitor could filter out the over-priced properties from his search. To this end, we used the linear regression and random forest regression models to predict the price of the listing using the different attributes corresponding to each listing. Some of the features we use in developing the model are: ‘host_response_time’, ‘host_response_rate’, ‘neighbourhood_group_cleansed’, ‘property_type’, ‘cleaning_fee’, ‘guests_included’, ‘extra_people’, ‘review_scores_location’, ‘review_scores_value’, ‘instant_bookable’, ‘cancellation_policy’, ‘amenities’, etc. We then pre-process the data, fill missing values in a columns with the mean of that column, convert the categorical variables to dummy variables, divide the data into the training and test datasets.

We first fit a linear regression model to the training data. Below are the results of fitting a linear regression model to the training data. We also show the features that have the highest coefficient value.

Mean squared error for test data 3197.486875
R2 test data 0.606042
Mean squared error for test data 3197.486875
R2 test data 0.606042

We then fit a random forest regression model to the training data. Below are the results of the random forest regression model as well as the important features extracted from the fitted model.

Mean squared error for training data 2620.798935
R2 training data 0.678375
Mean squared error for test data 3338.404526
R2 test data 0.588679

From the analysis above, it can be seen that both the Linear Regression and the Random Forest Regression models give a R2 score of 0.6 (even on the training data), which is not a good result. As far as feature importance is concerned number of bedrooms, number of bathrooms and number of people that a listing can accommodate represent the three of the most important predictors. To reduce this underfitting, there are several issues that can be addressed:

  1. Need bigger datasets with more listings.
  2. More complex models.
  3. Use feature selection methods to reduce the features.
  4. Invest more time in feature engineering.

The notebook with all the plots and analysis can be found here.

--

--