What do AB tests actually measure?

If you’re like me, you’ve done a lot of AB tests before thinking too deeply about the experiment methodology. After all, they’re standard, battle-tested methods for making data-driven product decisions. We randomize exposure to variants A and B; we make sure the test sticks to users across sessions; we wait long enough for dynamic effects to stabilize. It turns out there’s still a potentially big problem. There are also strategies for overcoming it. Let’s start with a motivating example.

Website X has a pretty good user following. They’re a proud, data-driven site, and, whenever possible, make product decisions based on the results of carefully designed AB tests. When a user lands on a page where a test is running, the user is assigned to variant A or variant B at random, and the assignment sticks to that user across different sessions on the site.

When website X runs an AB test, they let the test run. Users visit their site, and are included in the AB test by assigning them randomly to test or control at the time of their visit by hashing their cookie IDs to a test bucket. The experimenters wait until they get a large enough sample size to have enough statistical power to test their hypothesis, and they stop the test.

Can you see what has gone wrong here? It’s really one critical mistake that can manifest in several different ways. The problem is called “self-selection bias,” and this example manifests at least two different versions of it. (I’ve looked around for some papers on self-selection bias in AB testing, but haven’t found much on the subject. If anyone has references, I’d love for you to post them in the comments!)

Self-selection Bias

The basic problem is that people aren’t selected at random into the experiment. There is a whole universe of people, U. Some subset of them, I, are users of the internet, where I ⊆ U. Some subset of them, W, are users of your website, so W ⊆ I. Some subset, T, of those will be online and assigned into your AB test while the test is running, so T ⊆ W. This subset T is not a random subset of all potential site users (i.e. the set I, or arguably, U). They’re not even a random subset of site users.

When most of us think of running an AB test, we’re thinking about running the test in W, not T. Have you ever considered whether the user group selected into your test is representative of the general population of your site? There’s a mechanism that can make sure it’s not: time. We’ll elaborate more on this in the next section.

Even worse than this temporal bias, sometimes we’d like to be running the test in I, the whole internet population, not W. Consider the case when your performance metric (your key performance indicator, or KPI) is related to growth. If your user base, U, is much smaller than I, and you successfully change your site in a way that people in I will love but U will hate, then your test will give very negative results in the short term, even though it has opened up enormous potential for growth.

Time to get Quantitative

Now that we understand the problem, we’d like to know how to solve it. It’s time to get some code and math involved.

The following embed will work better if you view it on github. You can find it here

So the upshot is that the experiment is really measuring a conditional average treatment effect: the experiment’s effect on people who entered into the experiment. This makes sense: we know we’re really doing the experiment to measure a treatment effect within the population who visits the site while our experiment is running. The key problem is when you don’t consider whether they are the right population for your experiment. If it’s not, then you have to worry about whether the effect you measure is really the one you’re interested in. In this example, there was a large bias!

The problem can get worse than this. Suppose instead that the u variable is actually the user’s level of activity: a user with higher u would tend to click on more things, visit the site more often, share more often, etc.

In the early days of an AB test, you really select users with high visit rates: they’re more likely to be on the site at a given time. If a treatment is more effective on highly active users, then your test ends up biased because it really only measures the treatment effect on very active users of the site.

Even worse, if a test is very successful among power users, and hurts normal users, it would be possible to have a positive AB test result while whittling away your audience!

A way to avoid some of this bias is by doing assignment offline from users who have visited over a long time range (more representative of U), and doing intent-to-treat analysis after the fact. This way, you can give up on getting a point estimate in order to be sure about your confidence intervals.

Another way that’s consistent with the AB testing implementation on site X is to let the test run for a longer period of time, to get a more representative sample of the general population on the website, and not just the ones who perform the most actions.

Both of these methods are still just for typical site visitors. There is a good body of literature on adjusting a non-representative sample (T isn’t representative of I) to look more like a representative sample. Post-stratification weighting is commonly used, e.g. in google consumer surveys.

In summary, AB tests don’t measure treatment effects over the general population, U, of all people, or even the general internet population, I, or even the general population of your website, W. They measure treatment effects over a population, T, that selects themselves into the AB test by visiting your site during the time the test is running. This can cause these users not to be representative of the population you might prefer to run your experiment on!

If you’d like more from my series on causal inference in data science, check out the index here!!