For many of our potential guests, planning a trip starts at the search engine. At Airbnb, we want our product to be painless to find for past guests, and easy to discover for new ones. Search engine optimization (SEO) is the process of improving our site — and more specifically our landing pages—to ensure that when a traveller looks for accommodations for their next trip, Airbnb is one of the top results on their favorite search engine.
Search engines such as Google, Yahoo, Naver, and Baidu deploy their own fleet of “bots” across the internet to build map of the web and scrape information, or “index”, from the pages that they hit. When indexing pages and ranking them for specific search queries, search engines will take into account a variety of factors, including relevance, site performance, and authority. In order to improve our rankings we can make changes to pages, such as clarifying the purpose of our content (relevance), improving page load time (performance), or increasing the number of quality links that point to it (authority). Such examples only touch the surface of how we can optimize our pages to improve rankings.
For example, in late 2017 we created new landing page, referred to internally as “Magic Carpet”, to replace our normal search results landing page. The page featured a large header with an image & search box, along with extra content such as reviews & listings beneath.
We hypothesized that this landing page would increase relevance with its clearer content, and decrease page load time with its lighter code structure, among many other improvements. This would consequently lead to to a higher ranking for our page on search engine results.
But since we can’t know the exact ranking of our pages, we rely on traffic as a proxy for an increase in rankings. That is, when our ranking for our San Francisco search page jumps, we’d expect to see an increase in the traffic to this page from search engines. But how can we measure this effect?
Limitations of A/B Testing
Our Growth team leans heavily on iterative experimentation for nearly every product change to make sure that we can measure effectiveness and to learn as we build. Most data scientists are able to leverage a traditional A/B test at the device- or user-level for all of their experimentation needs. In this setup, users that enter the experiment are randomly bucketed into the treatment groups, and we can directly compare the outcome of the treatment group with that of the control group.
A/B tests have very good power and allow for complete randomization. At Airbnb, they are useful for measuring the treatment effect on metrics pertaining to engagement and conversion. These are events that we log on the Airbnb site, and we can easily measure the incremental uplift in these metrics through a difference-in-means hypothesis test, such as a t-test.
However, in the case of our new “Magic Carpet” page, an A/B test will not allow us to measure an increase in traffic caused by a change in its external search engine ranking. A given page will look different across various search engine bots, and therefore we cannot isolate the effect of Magic Carpet on our rankings.
Therefore, quantifying the impact of such a product change requires a more sophisticated approach.
Leveraging a Market-Level Approach
A key realization is that our search results page isn’t just a single page; in fact, there are many different versions for different cities, towns, and regions. Each of these have what is called a unique “canonical URL”, and we actually have over 100,000 of them that are surfaced on search engines! Therefore, instead of assigning a single visitor to treatment or control, we can set the unit of randomization of our experiment to be a specific canonical URL. We’ll then measure the effect using an approach commonly used in market- or cluster-level experiments.
For example, the San Francisco search results page may be in treatment, and we’d update it with our Magic Carpet design. Meanwhile, the Paris page in the control group will be left the same. This random assignment will then continue to be applied to each of our ~100,000 URLs. This way, when a search engine bot scrapes our site, it will consistently see the same treatment for each page, and the ranking for that page will update accordingly.
However, a more nuanced statistical approach needs to be taken in this case, because we cannot simply make a direct comparison between the traffic of treatment and control URLs. This is because the baseline traffic between different URLs can differ substantially, and many times this difference is larger than the treatment effect we wish to detect. The San Francisco page may have a similar amount of traffic as Paris page, but it probably has something like 100% more traffic than the page of a smaller city, such as New Orleans. This makes it very difficult to measure something like a 2% lift in traffic!
For this reason, we need a mechanism to account for the inherent differences that exist between these URLs by leveraging pre-experiment data, before the change took effect.
Developing a Model: Difference-in-Differences
A difference-in-differences framework is one technique that utilizes pre-experiment data to control for these baseline differences in the absence of any interventions. We can use this method to measure the treatment effect and its statistical significance by using an estimator from a linear model, where for each page i and day t:
Our principle variables include:
- traffic_it = number of landing page impressions to page i on day t. We apply a log to this outcome variable to account for its right-skew, and to regularize the heteroskedasticity commonly present in traffic data.
- treatmentᵢ = treatment group indicator (equal to 1 if in the treatment group, 0 otherwise)
- post_t = pre/post-period indicator (equal to 1 if in the post-period, 0 otherwise)
However, there is still a lot of variation across time and markets in our traffic data that may hinder us from detecting a treatment effect. The difference-in-differences approach allows for an elegant solution to this problem; we can simply add in covariates to our model in order to control for the various effects:
- aᵢ = fixed effect (or mean) for a page, to allow for flexible intercepts of each URL
- t = time index, to account for overall time trends
- dowⱼ = weekday indicators, to account for weekly seasonality
Since we’d like to know the effect of the treatment group in the post-period, the b₂ coefficient reflects the “difference-in-differences”, and therefore the treatment effect that we wish to estimate.
In simpler terms, we’re looking for the incremental effect on the treatment group after the experiment started.
However, it is a common trap to overstate statistical significance when analyzing times series data in a difference-in-differences framework. This is because without any correction of our standard errors, we’re basically assuming that each additional day of traffic data for a given page is independent from the previous traffic information that we have already collected. However, this assumption is faulty, since we expect the traffic to have a high serial correlation within a specific specific market over time.
Therefore, to lower our Type I error, in our model we cluster the standard errors at the URL-level to correct for this serial correlation, where our variance-covariance matrix for our model coefficients is calculated as:
Where nᵤ is the number of URLs, and eᵢ is the raw residual for the iᵗʰ observation. When there is a correlation of traffic within canonical URLs across days, this causes the standard errors of our coefficients to increase. Therefore we effectively become more strict with our criterion to declare an experiment significant. As a result, we can be more confident that the experiments that with statistically significant estimators are more likely to be true positives.
Before we launch an experiment, it’s important that we understand our statistical power. Since we mainly just care about the b₂ estimator, we are essentially carrying out the hypothesis test
where our power is defined as:
In other words, we’d like to know the probability of being able to detect a treatment effect in the case that there truly is one. If the power of our experiment is very low, then it may be useless if we’re not able to measure anything.
There are a number of ways to estimate power, one of the most common being a simulation-based estimation.
Using historical traffic data, we can run a set of simulations in which we randomly assign canonical URLs to treatment & control, and apply varying levels of traffic lifts to the treatment group in a predefined time period. We can then run our model on these data and see how many times we can detect the effect to a specific degree of statistical significance.
Using these simulated results, we can then plot how many times the model measured a statistically significant difference across different treatment effects:
Given that we’d ideally have at least 80% power, our experiment likely has enough power in scenarios in which the treatment effect is around 2% or greater. This is quite a granular detectable difference, and given that we expect Magic Carpet’s effect to be on the order of multiple percentage points, we conclude that this model has enough power for us to run a full-fledged URL-level experiment.
Launching the Experiment
Once we set up our model with the appropriate assumptions and asserted that we had sufficient power to run a test, we launched the Magic Carpet experiment and randomly released the new design to half of our landing pages. The test lasted three weeks, in which we saw a visible lift in traffic:
When we ran our differences-in-differences model, we found that there was in fact a statistically significant positive result:
Because we applied a log transformation to our outcome traffic variable, this allows us to interpret our coefficient in terms of a percentage: a b₂ coefficient of .0346 means that Magic Carpet resulted in a (1 - e^(.0346)) = 3.52% increase in traffic. This may not seem large at first, but consider that this equates to tens of millions of additional visitors a day! After this testing period, we decided to launch Magic Carpet to 100% of our search results landing pages, and for the last year we have continuously iterated on designs using this same experimentation framework.
Approaching our SEO landing page experiments through a market-level framework has proven to be very useful for measuring effectiveness of changes to our product in terms of search engine rankings. In fact, we were able to scale this framework using our using our open-sourced Airflow scheduler to automate the analysis of over 20 experiments, ranging from sweeping design changes to small HTML tweaks.
Yet there is always room for improvement. Investing in tracking our exact search engine rankings would allow for a more granular outcome variable in our model, rather than using traffic as a proxy. In addition, there are plenty of other models utilized in market-level experiments, such as synthetic controls, that could be considered beyond our differences-in-differences approach.
However, regardless of the exact model used, the same lessons and key learnings apply. When making inferences on a given treatment effect, we must always make sure that (1) our experimentation model has the correct assumptions baked in, especially when failing to do so could result in a Type II error, and (2) the test has sufficient power to detect a treatment effect, so that we can ensure that our Type I error rate is not too high. When such frameworks are used correctly, they can be used as impressive tools for measurement.
Interested in building more impactful & interesting experimentation frameworks? We’re always looking for talented data scientists to join our team!
Special thanks to Lilei Xu, who developed this original framework & helped review, and to Robert Chang who provided great feedback and guidance throughout the review process.