The Three Types of A/B Tests

Carson Forter
Towards Data Science
6 min readDec 31, 2017

If you work in or around data you’ll likely know that the term data science is much contested. What it means and who gets to call themselves a data scientist is discussed, disputed, and mulled over in countless articles and blog posts. This post is not part of that dialogue — but it is about a similarly ambiguous and also misunderstood concept in the world of data: A/B tests.

In the tech world, the term A/B test is used to refer to any number of experiments where random assignment is used to tease out a causal relationships between a treatment, typically some change to a website, and an outcome, often a metric that the business is interested in changing.

But the case I’ll try to make in this post is that there are really (at least) three different types of experiments that web businesses run, and classifying them all under a single umbrella can lead to poorly designed experiments and misunderstood results.

A/B Tests

The simplest kind of experiment typically focuses on UI changes. A product team will test two or more variations of a webpage or product feature that are identical except for one component, say the headline copy of an article or the color of a button. Google famously tested 41 different shades of blue for a button to see which one got the best click through rate. While A/B refers to the two variations being tested, there can of course be many variants, as with Google’s experiment.

In this type of test, there is usually just one, or perhaps two, metrics the product team cares about, and whichever variant has the best value for that metric(s) will be picked. In other words, the hypothesis is always: this UI change will increase/decrease metric X. After assessing this, the winning change will be made permanent and the team will move on to the next test.

My Take

The benefits of any given change identified by one of these tests is going to be tiny. Think fractions of a percent. For these to have any material impact on your business you need to have two things:

  1. the infrastructure to run and analyze them rapidly — ideally automatically
  2. a user base big enough that your tests are powered appropriately even over a short period of time

The upshot is that I don’t see these types of test being very effective anywhere but the largest companies: Google, Facebook, Netflix, etc. They have both the mature infrastructure to run many of these tests quickly and the huge user bases that allow them to identify tiny treatment effects with statistical significance. If you’re at one of these companies, this type of rapid testing is quite valuable, since the small changes can add up quickly, but otherwise your effort is better spent elsewhere.

If you are in a position to run these types of tests, you really want to pay attention to the details. Appropriate power levels and p-value corrections for multiple comparisons are critical to making sure that the wins from these tests add up to a material overall improvement.

Product Rollouts

The second common scenario where a randomized experiment can be helpful is when rolling out a complete product that a company is already committed to launching. Think something like Facebook’s newsfeed launch or Linkedin’s full site redesign. By the time something this big has been built, the launch is very, very unlikely to be permanently rolled back no matter what the metrics say.

Rather, the randomized experiment in this case is for visibility, and to provide information that might help with making future decisions.

Visibility here typically means bugs — did you somehow break a fundamental feature with this launch? Future decisions, on the other hand, can be informed by a randomized rollout in that you’ll know the true impact of your launch. If you had a positive impact from your redesign or new feature, similar endeavors might be worth looking into. If your results were neutral or negative that’ll help assess whether it’s really worth working on projects like this going forward.

My Take

A staged rollout like this is not hypothesis driven. You’re not trying to find evidence for a particular idea — you’re just monitoring a new product to look for encouragement or red flags.

These types of experiments shouldn’t be analysis heavy, and I wouldn’t sweat the statistical details as much as with an A/B test. Worrying about statistical power or p-value corrections is not particularly relevant and is likely time that could be better used elsewhere — you’re really just looking for directional evidence on whether a launch was net positive or not.

I would like to add caution that the results from these rollouts suffer from two huge sources of uncertainty that can sometimes make them difficult to interpret.

One is that when you build an entirely new feature and roll it out through this type of test, the code that collects the data in the treatment group is also often new. The data that comes from the control group however has typically existed for some time. This means big differences between the two groups are sometimes not driven by user behavior, but rather by differences in the way the data was collected. Because of this, care needs to be taken both in instrumenting your data and in interpreting results. I’d be suspicious of double digit percentage changes and investigate the data-logging logic as the most likely cause.

The second bit of uncertainty is that even if you’re totally confident in your data, a big change like this has so many things that are different compared to the pre-launch version of your product, that identifying why the metrics changed in a certain way is challenging. The best way to mitigate this is to get ahead of the issue and collect lots of behavioral user data so you’re not blindly trying to explain a big drop.

Scientific Experiments

Finally we come to what I think of as true scientific experiments. These are the closest analogue to randomized controlled trials in the social sciences and economics: you have a non-obvious hypothesis about the way the world (or in this case your product) works and you design an experiment that will test it empirically with data.

Some example hypotheses:

  • Subscriptions increase logarithmically with the volume of upsell messages
  • Encouraging users to add friends on your website increases daily active users
  • Recommending similar products increases, rather than cannabalizes, revenue

My Take

In my view these types of questions are the most important as they don’t just provide information for a one-time decision (should I make this button blue or red?), but instead offer generalizable knowledge that will inform how you think about and build your product in perpetuity. These are the types of insights on which successful products are built.

Perhaps unsurprisingly then, I think it is these types of questions that data scientists are uniquely positioned to answer. Whereas A/B tests can be automated, and rollouts can be monitored by someone without much technical knowledge, scientific experiments need a mix of business, product, and statistical skills that usually only data scientists will have.

For experiments like these, the statement of the hypothesis and the experimental design are the focus. Thinking about issues like the generalizability of results, heterogeneous treatment effects, and the choice of outcome metrics and control variables is critical. Considering these and designing appropriately can have serious implications for the quality and usefulness of the experiment’s findings.

While dashboards and charts are probably the best way to communicate the first two types of tests, scientific findings need to be written up. Numbers alone won’t communicate the results — you need numbers, of course, but also the context, the implementation details, and, perhaps most importantly, a narrative that fits your findings into a broader understanding of your products, users, and business. Though these findings will be credible and useful they should never be final: your whole organization should continue to learn and update their ideas as your body of research expands.

Three Tests, One Goal

I’ve outlined what I think of as the broad categories of experimentation at a software company: A/B tests, rollouts, and scientific experiments. I’m advocating for labelling and conceiving of each of these as distinct techniques.

I think doing this will improve our process and expectations for each of these types of tests, and ultimately will lead to better decision-making for our organizations— which after all is what data science is all about.

I’m a data scientist and researcher working in the tech industry and writing about it here on Medium. You can also follow me on Twitter and Linkedin.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Carson Forter
Carson Forter

Responses (3)