Airbnb in Boston and Seattle

Juan-Carlos Lopez
13 min readApr 20, 2019

--

Availability, Pricing, and Reviews

This is a summary of a quantitative-research project using public Airbnb data for the cities of Boston and Seattle. My analysis relies on six datasets — that is, three analogous data sources for each city:

  • Listings: home-level information about Airbnb listings.
  • Reviews: review-level information, including comments.
  • Calendar: daily price and availability data per listing. For Boston, the data covers from 2016–09–06 to 2017–09–05; for Seattle, 2016–01–04 to 2017–01–02.

I used my own background knowledge and intuitions concerning Airbnb, as well as the broader hospitality industry, to come up with three main questions that could be investigated using these datasets:

How does the average availability of Airbnb homes vary over time in each city? How does it compare between cities?

Is there a clear seasonal pattern for the average nightly rate on Airbnb in Boston and Seattle? Is there a clear weekly pattern?

Is there a clear association between a listing’s nightly rate and the number of guest reviews the home has received?

Question 1:
How does the average availability of homes vary over time?

Our calendar datasets include a variable for the availability of a home per date, over the sample timeframe of 365 days. That is, for a given date an Airbnb home can be listed as either available or unavailable. With this variable, we can build a sample distribution, for each city, for available nights per year.

Distribution of home availability in a year

Excluding both tails of the histograms above — near 0 nights and 365 nights of availability — the distribution of available nights per year looks similar between Boston and Seattle: below approximately 300 nights per year, the distribution is fairly uniform and it seems to increase rapidly after that. In other words, a plurality of Airbnb homes is recorded as available on at least 300 nights of the year for both Boston (~34%) and Seattle (~52%).

Despite these similarities in the middle of the histograms, there are significant differences between the cities if we look at the extremes. For Boston, we clearly see a dominant spike in the distribution near 0 nights of availability; whereas for Seattle, the dominant spike is near 365 nights.

My best guess is that the differences in the tails are a product of different approaches to sampling the Airbnb data — home listings and dates — between the two cities, which would introduce a statistical issue called selection bias. In other words, I have a difficult time coming up with a hypothesis which would explain why these differences are a true characteristic of the underlying population of Airbnb listings in these two cities.

A plausible scenario consistent with selection bias is that all the Seattle listings were active on Airbnb during the year of sample (such that hosts were actually managing the availability of their homes on the platform), while a large group of Boston homes only became active listings in the platform after the Boston sample was collected.

Average availability of homes over time

The histograms above provide a snapshot of a year of availability per home in their respective city. But in order to understand the dynamic, temporal behavior of availability in a city we can aggregate the daily snapshots of the data and analyze these statistics over time.

Proportion of available homes per night

This figure includes plots for the proportion of available Airbnb homes per night. The statistics are calculated separately for each city and plotted against the calendar date.

In the first three months of their corresponding samples, Boston and Seattle show a similar ramp-up pattern in the proportion of available homes per night. After that, the proportions are mostly stable, with some small variation.

The similarity of the patterns can be seen even more clearly if we plot the proportions of available homes against the day-of-sample; that is, from 0 to 365.

Proportion of available homes per night

The plots follow closely similar trends in the two cities, but they stabilize at very different values; with Seattle’s availability being significantly higher. This is consistent with what I noted in the previous section concerning selection bias. My guess is that this gap between the proportions is a product of different approaches to sampling between the two cities; as opposed to Seattle and Boston inherently having such different levels of availability in the long run.

I presume that the datasets were backfilled with data of homes before these were actually on Airbnb, which would make sense if the objective were to have tidy panel data, with the same dates for all homes. But this approach would produce an artificially low proportion of available homes, which would ramp-up over time.

My final observation is that the proportion of available homes does not seem to follow a clear seasonal pattern. Although it is plausible that some of the small dips in availability are associated with high-demand dates, such as Spring Break and the start of Summer.

Question 2:
What are the seasonal and weekly patterns of average nightly rates?

The calendar datasets also include a variable for price per night, per home, over the sample timeframe of 365 days. For a given date, if the home is listed as unavailable then no price is posted; if the home is listed as available, a nightly rate is posted. With this variable, we can build a variety of time-series plots for average nightly rates in Boston and Seattle. We will start by plotting the average rates against the calendar date.

Average nightly rate vs date

The plots show that both average nightly rates are pretty noisy statistics, showing significant variation. Since the Seattle time-series coincides with a calendar year, it’s easier to identify a seasonal pattern of higher prices in the Summer.

The seasonal pattern should become clearer when we look at rates by time-of-year; that is, by disregarding the year and plotting the average rate from January 1st to December 31st.

Average nightly rate vs day-of-year

Comparing the last two figures, I propose the following plausible insights concerning the average nightly rates:

  1. Average rates increase independently of the seasonal pattern. Thus, I would guess that if the sample covered a longer timeframe, we would see that the average rates in these two cities were increasing year-over-year.
  2. The seasonal pattern of the average rate is analogous in Boston and Seattle; that is, peaking in the early Summer months and bottoming out in late Winter.
  3. The spike in the Boston average rate that happens in September is not a product of real conditions in the city. This high and noisy portion of the plot occurs at the beginning of the sample; when few homes were available (as seen in the previous section) and prices were not posted. Thus the spike and erratic pattern are probably due to the small sample size.
  4. The other noticeable spike in the Boston average rate happens in April 2017. This spike is not a product of a smaller sample of available homes, and it is not present in Seattle; so the pattern is likely a consequence of some real event that affected the lodging conditions in the city.

Airbnb rate spike in Boston — April of 2017

For a few days in April of 2017, the average nightly rate spiked in Boston approximately from $190/night to $235/night. This spike is significant and there is no analogous pattern in Seattle.

Boston average rates in April 2017

The plot shows that the average rate spiked approximately from 2017–04–10 to 2017–04–18. Since the 2017 Boston Marathon took place on Monday, April 17, I think it is safe to say that this event caused a sharp increase in demand on Airbnb; which in turn caused the spike in nightly rates.

Within-week variation in nightly rates

Based on the time-series plots in the previous section (especially the issue concerning the small sample of available homes in Boston for a portion of time) I decided to focus on the Seattle data in order to investigate within-week variation in average rates.

A simple way to assess whether the nightly rates are consistent with my own intuitions about the lodging market — that rates are substantially higher on weekends — is to calculate the average rate per day-of-week for the Airbnb homes in Seattle:

                  ╔═════════════╦══════════════╗                       
Day of weekAverage rate
╠═════════════╬══════════════╣
║ Monday ║ $135.68 ║
║ Tuesday ║ $135.41 ║
║ Wednesday ║ $135.45 ║
║ Thursday ║ $136.48 ║
║ Friday ║ $143.04 ║
║ Saturday ║ $143.20 ║
║ Sunday ║ $136.46 ║
╚═════════════╩══════════════╝

Based on these simple averages, we can arrive at the following plausible insights:

  1. Rates peak for weekend evenings: Fridays and Saturdays. The average rate on weekends is about 6% higher than that from Mondays to Wednesdays.
  2. The lowest rates are posted for Monday, Tuesday, and Wednesday.
  3. Slightly higher rates are posted for Thursdays and Sundays.

Let’s close this section by taking a second look at the time-series of average rates in Seattle, but this time we will see the effect of separating the weekend rates (Friday-Saturday) from the weekday rates (Sunday-Thursday).

Average nightly rates in Seattle

These plots show that the noisy average rate in Seattle can be smoothed out significantly by separating the data and making one plot for Friday-Saturday nights and another plot for the rest of the weeknights. As noted before, the average weekend rate remains approximately 6% higher than the average weeknight rate during the entire sample.

Question 3:
Is there a strong association between the nightly rate and the number of guest reviews received?

In this section, we will look into the association between the nightly rates and the number of reviews a host has received on Airbnb. My preexisting ideas about reviews on Airbnb were the following:

Neutral guest experiences are less likely to end in a review.

Bad guest experiences are likely to end in a negative review, but hosts that receive too many of those have a hard time renting their places and end up leaving Airbnb.

Good guest experiences are likely to end in a positive review, and hosts with many positive reviews are successful and stay on Airbnb. Thus, more reviews tend to be a sign of quality hosting

Given these preconceptions about reviews, my working hypothesis about the association between rates and reviews was that the more reviews a host has received, the higher the rate they are able to charge.

Distribution of reviews amongst listings on Airbnb

Before jumping into the analysis of nightly rates, let’s look at the histograms of reviews-per-listing in the sample. These provide a useful visualization of how reviews are distributed amongst Airbnb listings.

Distribution of reviews-per-listing

The sample distribution of reviews-per-listing is highly skewed in both cities, with most listings having none or very few reviews, but a long tail of small numbers of listings having very many reviews.

In fact, to have a better visualization of the majority of the listings, I cut off the x-axes in the figure above to only show 90% of the listings. The remaining 10% of listings were spread out approximately between 60 and 400+ reviews.

Quantifying the association between nightly rates and number of reviews

In order to quantify the association between the nightly rate and the number of reviews corresponding to a particular Airbnb listing, I had to impose some sort of conceptual structure on the underlying relationship between these two characteristics or features. This conceptual structure is typically referred to as the model.

Many assumptions or premises go into building a model, but some are extremely important to note explicitly. For instance, my analysis does not incorporate the content of reviews. In other words, there is no distinction between positive, neutral, and negative reviews — all reviews are treated equally. A more complete analysis of the relationship between prices and reviews ought to look into the content of the reviews in order to assess how different types of reviews are associated with the nightly rates.

Another extremely important conceptual assumption of the model is known as the functional form. For this analysis, I rely on the commonly used multivariate linear regression model to quantify the relationship between rates and number of reviews. Roughly speaking, this family of statistical models can be used here to estimate the factor by which the average nightly rate varies as a function of the number of reviews; while holding constant other variables that are relevant but not of primary interest.

1) Nightly rate vs. Number of reviews

The first estimate for the simple association between rates and number of reviews comes from a naive model. I call this version of the model “naive” because it does not account for characteristics of a home — size, location, etc. — which in my opinion would clearly be associated with rates on Airbnb.

The naive estimate indicates that, on average, every additional guest review is associated with a decrease in listing price of 0.17%. This would imply that, for example, if an average listing starts with no reviews and a nightly rate of $100, after its first review the host would post a new rate of $99.83.

This finding is counterintuitive to me, given my preconception that reviews tend to be positive and associated with good quality of Airbnb homes.

2) Nightly rate vs. (Number of reviews + Home Size + Neighborhood)

Building on the naive model, I used the available data related to home size/capacity and neighborhood to obtain a new estimate that accounts for how homes of different sizes and locations likely post different average nightly rates, independently of their number of reviews.

Adding this data for home size and location produces an estimate that is about half of that of the naive model: on average, every additional guest review is associated with a decrease in listing price of 0.09%.

3) Nightly rate vs. (No. of reviews + Size + Neighborhood + Day of year)

The final version of the model builds on the previous one — which used home size and location data — by attempting to capture some of the variation in nightly rates which is seasonal or dependent on the day of the week.

But adding a time dimension to the model only slightly changes the previous estimate; that is, on average, every additional guest review is associated with a decrease in listing price of 0.08%. So it looks like my simple addition of time-trends is mostly irrelevant.

The tiny change in the estimate after adding time-trends to the model seems counterintuitive, especially after what we saw in the discussion of Question 2 above: that average nightly rates follow clear within-week and seasonal trends. However, given the Airbnb data we have, there is a reasonable statistical explanation of why adding time-trends did not make a huge difference.

In the data we have, the number of reviews is a constant number for each home, which means that we do not observe how each host modifies its own rate as they receive more reviews. This implies that, roughly speaking, the estimate of 0.08% is completely driven by average differences in rates across many homes with different numbers of reviews — as opposed to being driven by differences in nightly rates that would occur as each home receives reviews on different days of the week and times of the year.

Synthesis

The empirical or statistical findings in this project were interesting and even surprising to me, especially as they relate to my preconceived notions about Airbnb and the broader lodging markets. Some of my preconceptions were not consistent with the statistical observations — I wrongly presumed listings with more reviews would post higher nightly rates — while some others were indeed consistent — for instance, average nightly rates follow seasonal patterns and are highest on Fridays and Saturdays.

An extremely important takeaway of this project should be that the data does NOT speak for itself in each step of the analysis we had to establish very important conceptual constraints (that is, both domain-specific and statistical) that allowed us to draw business and engineering insights from the datasets.

For instance consider this figure, which includes the plots of average nightly rates in both Boston and Seattle. The Seattle rates follow a pattern that makes business sense: with consistent within-week behavior and a wide, slight peak in the Summer.

In contrast, the average rates in Boston follow an analogous general pattern but also show some significant differences: the within-week behavior is basically the same as Seattle, but there are very dominant spikes in the nightly rate — one in April and one in September — which muddy up the broader seasonal pattern.

In my analysis I argued that the spike in April was due to a real-world event, the 2017 Boston Marathon, which caused a significant increase in demand for Airbnb homes. On the other hand, I argued that the peak in September does not represent a real spike in demand and rates, but it is a result of the average rate being calculated with a small sample of available homes in Boston during the first few months of the sample. This distinction between a real-world demand and rate spike versus the issue of a nonrepresentative sample average can only be made confidently if one has knowledge about the underlying business domain and the data collection.

Finally, we saw that in our sample having more guest reviews is associated with slightly lower nightly rates. I am highly skeptical about this *finding* being a real causal link between rates and reviews on Airbnb. In other words, the simple associations I was able to quantify should be tested further using more robust business research strategies like prospective experimentation or richer analyses of existing data.

Acknowledgments

--

--