By Jan Overgoor
Airbnb is an online two-sided marketplace that matches people who rent out their homes (‘hosts’) with people who are looking for a place to stay (‘guests’). We use controlled experiments to learn and make decisions at every step of product development, from design to algorithms. They are equally important in shaping the user experience.
While the basic principles behind controlled experiments are relatively straightforward, using experiments in a complex online ecosystem like Airbnb during fast-paced product development can lead to a number of common pitfalls. Some, like stopping an experiment too soon, are relevant to most experiments. Others, like the issue of introducing bias on a marketplace level, start becoming relevant for a more specialized application like Airbnb. We hope that by sharing the pitfalls we’ve experienced and learned to avoid, we can help you to design and conduct better, more reliable experiments for your own application.
Experiments provide a clean and simple way to make causal inference. It’s often surprisingly hard to tell the impact of something you do by simply doing it and seeing what happens, as illustrated in Figure 1.
The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically. Controlled experiments isolate the impact of the product change while controlling for the aforementioned external factors. In Figure 2, you can see an example of a new feature that we tested and rejected this way. We thought of a new way to select what prices you want to see on the search page, but users ended up engaging less with it than the old filter, so we did not launch it.
When you test a single change like this, the methodology is often called A/B testing or split testing. This post will not go into the basics of how to run a basic A/B test. There are a number of companies that provide out of the box solutions to run basic A/B tests and a couple of bigger tech companies have open sourced their internal systems for others to use. See Cloudera’s Gertrude, Etsy’s Feature, and Facebook’s PlanOut, for example.
The case of Airbnb
At Airbnb we have built our own A/B testing framework to run experiments which you will be able to read more about in our upcoming blog post on the details of its implementation. There are a couple of features of our business that make experimentation more involved than a regular change of a button color, and that’s why we decided to create our own testing framework.
First, users can browse when not logged in or signed up, making it more difficult to tie a user to actions. People often switch devices (between web and mobile) in the midst of booking. Also given that bookings can take a few days to confirm, we need to wait for those results. Finally, successful bookings are often dependent on available inventory and responsiveness of hosts — factors out of our control.
Our booking flow is also complex. First, a visitor has to make a search. The next step is for a searcher to actually contact a host about a listing. Then, the host has to accept an inquiry and then the guest has to actually book the place.. In addition we have multiple flows that can lead to a booking — a guest can instantly book some listings without a contact, and can also make a booking request that goes straight to booking. This four step flow is visualized in Figure 3. We look at the process of going through these four stages, but the overall conversion rate between searching and booking is our main metric.
How long do you need to run an experiment?
A very common source of confusion in online controlled experiments is how much time you need to make a conclusion about the results of an experiment. The problem with the naive method of using of the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If you continuously monitor the development of a test and the resulting p-value, you are very likely to see an effect, even if there is none. Another common error is to stop an experiment too early, before an effect becomes visible.
Here is an example of an actual experiment we ran. We tested changing the maximum value of the price filter on the search page from $300 to $1000 as displayed below.
In Figure 5 we show the development of the experiment over time. The top graph shows the treatment effect (Treatment / Control — 1) and the bottom graph shows the p-value over time. As you can see, the p-value curve hits the commonly used significant value of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we would have concluded that the treatment had a strong and significant effect on the likelihood of booking. But we kept the experiment running and we found that actually, the experiment ended up neutral. The final effect size was practically null, with the p-value indicating that whatever the remaining effect size was, it should be regarded as noise.
Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern of hitting “significance” early and then converging back to a neutral result is actually quite common in our system. There are various reasons for this. Users often take a long time to book, so the early converters have a disproportionately large influence in the beginning of the experiment. Also, even small sample sizes in online experiments are massive in the scale of classical statistics in which these methods were developed. Since the statistical test is a function of the sample- and effect sizes, if an early effect size is large through natural variation it is likely for the p-value to be below 0.05 early. But the most important reason is that you are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
As a side note, people familiar with our website might notice that, at time of writing, we did in fact launch the increased max price filter, even though the result was neutral. We found that certain users like the ability to search for high-end places and decided to accommodate them, given there was no dip in the metrics.
How long should experiments run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that you care about and compute, based on the sample size (the amount of new samples that come every day) and the certainty you want, how long to run the experiment for, before you start the experiment. Here is a resource that helps with that computation. Setting the time in advance also minimizes the likelihood of finding a result where there is none.
One problem, though, is that we often don’t have a good idea of the size, or even the direction, of the treatment effect. It could be that a change is actually hugely successful and major profits are being lost by not launching the successful variant sooner. Or, on the other side, sometimes an experiment introduces a bug, which makes it much better to stop the experiment early before more users are alienated.
The moment when an experiment dabbles in the otherwise “significant” region could be an interesting one, even when the pre-allotted time has not passed yet. In the case of the price filter experiment example, you can see that when “significance” was first reached, the graph clearly did not look like it had converged yet. We have found this heuristic to be very helpful in judging whether or not a result looks stable. It is important to inspect the development of the relevant metrics over time, rather than to consider the single result of an effect with a p-value.
We can use this insight to be a bit more formal about when to stop an experiment, if it’s before the allotted time. This can be useful if you do want to make an automated judgment call on whether or not the change that you’re testing is performing particularly well or not, which is helpful when you’re running many experiments at the same time and cannot manually inspect them all systematically. The intuition behind it is that you should be more skeptical of early results. Therefore the threshold under which to call a result is very low at the beginning. As more data comes in, you can increase the threshold as the likelihood of finding a false positive is much lower later in the game.
We solved the problem of how to figure out the p-value threshold at which to stop an experiment by running simulations and deriving a curve that gives us a dynamic (in time) p-value threshold to determine whether or not an early result is worth investigating. We wrote code to simulate our ecosystem with various parameters and used this to run many simulations with varying values for parameters like the real effect size, variance and different levels of certainty. This gives us an indication of how likely it is to see false positives or false negatives, and also how far off the estimated effect size is in case of a true positive. In Figure 6 we show an example decision boundary.
It should be noted that this curve is very particular to our system and the parameters that we used for this experiment. We share the graph as an example for you to use for your own analysis.
Understanding results in context
A second pitfall is failing to understand results in their full context. In general, it is good practice to evaluate the success of an experiment based on a single metric of interest. This is to prevent cherry-picking of ‘significant’ results in the midst of a sea of neutral ones. However, by just looking at a single metric you lose a lot of context that could inform your understanding of the effects of an experiment.
Let’s go through an example. Last year we embarked on a journey to redesign our search page. Search is a fundamental component of the Airbnb ecosystem. It is the main interface to our inventory and the most common way for users to engage with our website. So, it was important for us to get it right. In Figure 7 you can see the before and after stages of the project. The new design puts more emphasis on pictures of the listings (one of our assets since we offer professional photography to our hosts) and the map that displays where listings are located. You can read about the design and implementation process in another blog post here.
A lot of work went into the project, and we all thought it was clearly better; our users agreed in qualitative user studies. Despite this, we wanted to evaluate the new design quantitatively with an experiment. This can be hard to argue for, especially when testing a big new product like this. It can feel like a missed marketing opportunity if we don’t launch to everyone at the same time. However, to keep in the spirit of our testing culture, we did test the new design — to measure the actual impact and, more importantly, gather knowledge about which aspects did and didn’t work.
After waiting for enough time to pass, as calculated with the methodology described in the previous section, we ended up with a neutral result. The change in the global metric was tiny and the p-value indicated that it was basically a null effect. However, we decided to look into the context and to break down the result to try to see if we could figure out why this was the case. Because we did this, we found that the new design was actually performing fine in most cases, except for Internet Explorer. We then realized that the new design broke an important click-through action for certain older versions of IE, which obviously had a big negative impact on the overall results. When we fixed this, IE displayed similar results to the other browsers, a boost of more than 2%.
Apart from teaching us to pay more attention to QA for IE, this was a good example of what lessons you can learn about the impact of your change in different contexts. You can break results down by many factors like browser, country and user type. It should be noted that doing this in the classic A/B testing framework requires some care. If you test breakdowns individually as if they were independent, you run a big risk of finding effects where there aren’t, just like in the example of continuously monitoring the effect of the previous section. It’s very common to be looking at a neutral experiment, break it down many ways and to find a single ‘significant’ effect. Declaring victory for that particular group is likely to be incorrect. The reason for this is that you are performing multiple tests with the assumption that they are all independent, which they are not. One way of dealing with this problem is to decrease the p-value by which you decide the effect is real. Read more about this approach here. Another way is to model the effects on all breakdowns directly with a more advanced method like logistic regression.
Assuming the system works
The third and final pitfall is assuming that the system works the way you think or hope it does. This should be a concern if you build your own system to evaluate experiments as well as if you use a third party tool. In either case, it’s possible that what the system tells you does not reflect reality. This can happen either because it’s faulty or because you’re not using it correctly. One way to evaluate the system and your interpretation of it is by formulating hypotheses and then verifying them.
Another way of looking at this is the observation that results too good to be true have a higher likelihood of being false. When you encounter results like this, it is good practice to be skeptical of them and scrutinize them in whatever way you can think of, before you consider them to be accurate.
A simple example of this process is to run an experiment where the treatment is equal to the control. These are called A/A or dummy experiments. In a perfect world the system would return a neutral result (most of the time). What does your system return? We ran many ‘experiments’ like this (see an example run in Figure 9) and identified a number of issues within our own system as a result. In one case, we ran a number of dummy experiments with varying sizes of control and treatment groups. A number of them were evenly split, for example with a 50% control and a 50% treatment group (where everybody saw exactly the same website). We also added cases like a 75% control and a 25% treatment group. The results that we saw for these dummy experiments are displayed in Figure 10.
You can see that in the experiments where the control and treatment groups are the same size, the results look neutral as expected (it’s a dummy experiment so the treatment is actually the same as the control). But, for the case where the group sizes are different, there is a massive bias against the treatment group.
We investigated why this was the case, and uncovered a serious issue with the way we assigned visitors that are not logged into treatment groups. The issue is particular to our system, but the general point is that verifying that the system works the way you think it does is worthwhile and will probably lead to useful insights.
One thing to keep in mind when you run dummy experiments is that you should expect some results to come out as non-neutral. This is because of the way the p-value works. For example, if you run a dummy experiment and look at its performance broken down by 100 different countries, you should expect, on average, 5 of them to give you a non-neutral result. Keep this in mind when you’re scrutinizing a 3rd party tool!
Controlled experiments are a great way to inform decisions around product development. Hopefully, the lessons in this post will help prevent some common A/B testing errors.
First, the best way to determine how long you should run an experiment is to compute the sample size you need to make an inference in advance. If the system gives you an early result, you can try to make a heuristic judgment on whether or not the trends have converged. It’s generally good to be conservative in this scenario. Finally, if you do need to make procedural launch and stopping decisions, it’s good to be extra careful by employing a dynamic p-value threshold to determine how certain you can be about a result. The system we use at Airbnb to evaluate experiments employs all three ideas to help us with our decision-making around product changes.
It is important to consider results in context. Break them down into meaningful cohorts and try to deeply understand the impact of the change you made. In general, experiments should be run to make good decisions about how to improve the product, rather than to aggressively optimize for a metric. Optimizing is not impossible, but often leads to opportunistic decisions for short-term gains. By focusing on learning about the product you set yourself up for better future decisions and more effective tests.
Finally, it is good to be scientific about your relationship with the reporting system. If something doesn’t seem right or if it seems too good to be true, investigate it. A simple way of doing this is to run dummy experiments, but any knowledge about how the system behaves is useful for interpreting results. At Airbnb we have found a number of bugs and counter-intuitive behaviors in our system by doing this.
Together with Will Moss, I gave a public talk on this topic in April 2014. You can watch a video recording of it here. Will published another blog post on the infrastructure side of things, read it here. We hope this post was insightful for those who want to improve their own experimentation.
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on May 27, 2014.