4 Principles for Making Experimentation Count

Published in

The Airbnb Tech Blog

7 min readMar 21, 2017

For over two years I’ve been a Data Scientist on the Growth team at Airbnb. When I first started at the company we were running fewer than 100 experiments in a given week; we’re now running about 700. As those of us in Growth know all too well, such growth does not happen organically. Instead, it comes about through cultivation. For us, this has meant not just building the right tools, such as our internal Experiment Reporting Framework [ERF], but actively shaping a robust culture of experimentation across all functions. Here I summarize four key principles that underlie our work and have led to step changes in the impact of experimentation on our business:

Product experimentation should be hypothesis-driven,
Defining the proper ‘exposed population’ is paramount,
Understanding power is essential, and
Failure is an opportunity.

Practicing these principles will not only save your org a ton of engineering time and money, but will allow incredible insight into your users and product.

Product Experimentation Should be Hypothesis-, not Feature-driven

We have incredible engineering talent at Airbnb. This means that it’s easy to build a feature, but that doesn’t mean it should be built, or that product will necessarily be better because of it. On the Growth team at Airbnb, we always start with the question, “What do the data say?” If you are not asking that question you are pursuing an incredibly inefficient product optimization strategy. If you have to guess, you’re putting the cart before the horse and should do more work before experimenting.

Why do hypotheses matter? Without them you’re untethered, easily distracted by what appear to be positive results, but could well be statistical flukes. In this situation it’s easy to make up a story that fits your findings as opposed to doing the hard work of understanding what’s going on. Things may surprise you — let them! If there is something we do not understand, we typically update our hypotheses and add metrics for clarity mid-flight, as our experimentation pipeline incorporates them with ease.

As an example, my team ran an experiment using a new translation service on our web and native apps. Naturally, we assumed that this new and improved service would increase conversion for both platforms. We saw that booking conversion in our native apps was jumping up in the experiment group, but we couldn’t understand why something similar was not happening on the web. After puzzling over it we hypothesized this was due to a product change by another team whereby visitors on our native apps were more likely to be using the translation service than those on web. We added a measure for this and were right! A higher rate of users in our experiment group were using translation services in the first place, and our new translation service encouraged them to use it more often. We couldn’t detect a change in our web group because they weren’t getting the exogenous lift that the native group had. Learning this opened up a whole area of strategic opportunity for us. If we hadn’t updated our hypotheses, who knows where we would have ended up.

Defining the Proper ‘Exposed Population’ is Paramount

Don’t just launch a feature or set up an experiment, and wait for the magic to happen. In more cases than not, there will be no magic. This does not mean you are not awesome, but is a reminder that our job is hard. One area where I’ve seen countless teams struggle in experimentation is properly defining the exposed population. The exposed population defines who should see a feature and who should not, and is distinct from the exposure rate, which determines how much of the exposed population is going to be included in your experiment.

For example, my team wanted to launch a message translation feature for guests who speak a different language than their host, with the hypothesis that this feature would improve conversion to book. Determining language is easy enough, but it’s not enough. If we launched this feature for all guests who message and speak a different language than their host, we’d be over-exposing our experiment- because not all guests who message are doing so to book. Some are on a trip and need to ask where the towels are. Others may have left a phone charger behind and would like it returned. While this may seem simple, you’d be surprised how often this can be glossed over or even ignored, since preventing it often takes seasoned knowledge about your business’ APIs. As a result, the first question I ask when chatting with an engineer to understand what’s going on in an experiment is simple: ‘When does exposure happen and how is it determined?’. Make people draw it out for you if necessary. This conversation will pay dividends for both parties later. (If they acknowledge that it may be too much work to properly expose users, first push back, then see the Hail Mary option below.)

Sanity metrics can be helpful here. If your experiment is limited to current users, add some metrics that would indicate if there are non-users (visitors) in your experiment, like signups. If you see significant numbers of signups happening in your experiment, you are probably not exposing it correctly. Another response is to compute global coverage for your metrics. If you expect an entire population to see your feature, confirm that they are. Overexposure will dilute metrics, with implications for power. It’s really awful to build a great feature but not be able to detect its impact!

Hail Mary: If you cannot accurately expose your experiment, make sure you have a way to identify the users who shouldn’t be in the experiment and drop them in the analysis stage. At Airbnb we do this by uploading an “exclusion table” to our experimentation pipeline, which includes all users that should be dropped from analysis due to improper exposure. Identifying these users can sometimes be incredibly onerous. If you are doing this work, make sure you share this with your partners as it is in the best interest of your whole team to understand data challenges and resolve them in scaleable ways.

Understanding Power is Essential

Power determines your ability to detect an effect in your experiment if there is one. You should not be running experiments if you do not understand this. You can, and should do better than guessing.

Three suggestions:

Get a sense for base rates before you start an experiment using historical data. Without base rates, you are essentially in the dark about whether you can actually detect the impact of your awesome feature change.
Unless you are working on features you understand inside and out, go big or go home. Do not launch something if you do not think it will move the needle. This is especially important if you are unsure about base rates. Unless your base rates are massive, the only way to detect changes is to make to big ones that will move metrics in a major way.
Remember that experimentation is not the only way to learn things about your users. Just because you lack power and therefore shouldn’t run a controlled experiment does not mean the game’s over. At Airbnb, we work closely with a team of researchers and survey scientists who do cutting-edge and thoughtful work on user behavior. See researchers as partners: their insights can be the very bridge your team needs to understand your users to develop a really impactful feature.

Failure is an Opportunity: Use it

It’s sometimes tempting to use experimentation as a way to prove that you can move metrics, and if (when) you don’t, move on (to another awesome moonshot idea). You should be moving metrics, and when you do, you should be able to show it. But if you only focus on the wins you are going to miss a ton of insight, and risk being blind to mistakes.

Experiments do not fail- hypotheses are proven wrong

When this happens, make it your job to understand why. Some questions to get you started include:

Was the hypothesis wrong, or was the implementation/execution of the hypothesis flawed? We typically start with the latter and make our way to the former. Our work is complicated, often in ways that defy full comprehension by any single person. This means we do not always get things right the first time. If you are adding a feature that you hypothesize to impact downstream conversion, do not just look for changes in conversion- you may not be seeing any because your feature may not be working as intended. One easy way to test this is to make sure you have logging on the features you are testing. (This is why any good data scientist will push for proper logging before launch- we cannot measure things that do not exist.)
Are metrics moving together in the ways that I would expect? Funnels follow predictable patterns. If upstream metrics are not supporting your downstream metric movement, you’d better have a good explanation for it. (Pushing on this front is a quality of a great Product Manager.) But keep in mind confirmation bias- most of us are incentivized to confirm that our work is impactful. For this reason, at Airbnb we try to make it a regular practice to socialize our experiments through both informal check-ins and bi-weekly Experiment Reviews, whereby teams present on lessons learned from experimentation. The more people who hear your finding, the more feedback you will get about it. This process will test you and your team in all the right ways.
Am I testing bold enough hypotheses? You will probably feel awesome if your hypotheses are consistently borne out. But be careful! Don’t get caught in local maxima optimization. Keep pushing.

There is a learning curve here. But like most things, it gets easier. It is incredible what you can learn by routinely starting from a place of curiosity.

Conclusion

Experimentation is hard work. A sophisticated experimentation tool is just that- a tool. It does no work on its own. That leaves all of us who experiment with incredible opportunity to shape the businesses we care about.

Intrigued? We’re always looking for talented people to join our Data Science and Analytics team!