When to Stop A/B Experiments Early

Published in

The Startup

9 min readNov 30, 2020

Let’s talk about decision processes for stopping A/B experiments early.

By that, I don’t mean concluding an experiment. But rather about stopping an experiment because we suspect there to be something wrong.

This could be something related to the build, the data-collection, or the design of the experiment. Or it could just be that a change is so dramatic, that it impacts users in a concerning way.

Whatever the issue, it’s vital to know and act as early as possible because restarting an experiment usually means also losing the data that’s been collected so far. As you can imagine, the decision to stop gets harder the longer an experiment has been running.

So, how can we make this decision with confidence and clarity? What we need is an “early warning system” to help make these kinds of decisions quickly and effortlessly.

In this article I’ll detail a process I rolled out at Trainline where I had multiple teams of product owners, designers and developers monitoring their own respective experiments for signs of early problems. So, this is an accessible process — i.e. no analytics and/or statistics skills are required.

Just remember the purpose of this process is not to conclude an experiment or read into results early, but rather to identify issues.

The early warning process

There are three questions at the core of the monitoring process:

Do we have an even split across across variation groups?
Do we have the traffic we expect in our experiment?
Are any important health metrics identifying potential issues?

Let’s look at each of these in turn…

1. Do we have an even split across across variation groups?

Analysing A/B experiments is all about comparing the A and B groups. If we have skews of counted traffic between the groups, it usually leads to invalid test reads.

So how do we check this? Well, first imagine each user on our website as a dot. We might expect the traffic split for an A/B experiment to look like this:

But wait, this isn’t accurate! You see, we’re only interested in users counted in our experiment. This could be different from the above view. The real view is likely to look like this:

The red dots represent counted users. This is a much more accurate view.

Now, if the visual above represented the ideal, what we’re trying to avoid is something like this:

Basically, we’re looking for an imbalance in counted traffic in our variation groups. If you want to know more, check out my article on the subject.

We can check the traffic distribution of our experiment using this simple formula:

Insert the volume of counted users into the formula

The formula above asks what is percent of users are in group A? We’re hoping to see a number as close to 50%, as possible.

Note that we check the volume of “users”. Not “visits” or “hits”. This is because we might expect to see skews to “visits” and/or “hits” as a result of the experiment design.

Also note that since traffic in an A/B experiment is counted randomly, we expect to see values like 49.8% and 50.2%. These kind of variances are natural, even with larger volumes of traffic.

When traffic volumes are low, the variances could be slightly larger, but they shouldn't be massively off. When you do see larger variances, monitor closely to check if the splits stabilise.

If the variance is anything like 1% or more, and the test has been running for a few days, then there’s likely a problem to investigate.

It’s might be useful to run the same check for important segments like new and returning users. But when you do this, expect to see slightly larger variances. Also, just to reiterate: you should use “users” here too.

Ideally, we’d have some predefined rules to validate the experiment splits. To help define these rules, consider running and monitoring a few A/A experiments to get familiarility with the natural variances of your site.

Make sure you run a few in different areas of your site and note that due to the random nature of counting, not every A/A will run in an identical way.

FYI: An “AA experiment” is where both control and variation are identical. We run A/A expetiments where we want to rule out variation differences as a reason for seeing results.

If you’re unsure about anything, consult someone with experience in these matters — like an Analyst and/or an Optimisation Specialist.

2. Do we have the traffic we expect in our experiment?

Missing traffic from our experiment is usally an indication of a larger problem, which may invalidate our test read. Therefore, it’s important to track this for our experiments.

I refer you once again to the image of traffic counted into our experiment:

If we take the traffic in it’s entirety:

The red dots convey the total volume of traffic counted.

Essentially, for this check, we’re looking for this:

Crossed out red dots are users “missed” from the experiment

That is, we’re looking for users who are missed from being counted in our experiment — the crossed out red dots. Ideally we should have as few crossed-out red dots as possible.

Now, there are multiple ways of doing this, and each site may have a different process. But ultimately, we’re looking for this:

i.e. How many users who should be in the test are actually counted?

Note: you’re not likely to get the value of “Users expected” from an experiment tool dashboard. So, you might have to get this from your site’s analytics package.

For example, if you’re using Google Analytics, then “Users expected” would be a recreated segment which most closely resembles your experiment counting criteria. You might need an Analyst to help here.

An Analyst can also help set up relevant processes to repeat the task for other experiments. They might ultimately decide to use a different way to find the answer, for example by using a segment like this to get the percentage of missing users:

Formula to find missing users from your experiment

As with the previous check, it might be useful to run some A/A tests to figure out what your expectations should be here. Use these expectations as guidelines to validate experiments.

We could even do this ahead of each experiment launch in order to predict experiment traffic volumes beforehand. Although, it may not be possible to do for all experiments, we should still be able to get “close” enough for this to be useful.

We could actually go further to check if those missing users actually saw the experiment — i.e. whether this is a data-collection problem. But I will leave that as an aspect of the debugging process that analyst will do. We’re just trying to determine a problem here.

3. Are any important health metrics identifying potential issues?

You should have already determined some critical health metrics for your experiments beforehand—e.g. conversion rate. Some may even be test-specific health metrics.

If you haven’t, then you might want to read this.

What we’re trying to determine are signs of unnaturally strong negative or positive impacts to our health metrics. These are usually signs of a problem with our test.

Here are some of the problems I’ve encountered in the past:

defect in the code
defect with metrics
an impactful variation design
extreme interaction between overlayed experiments
skew in counted traffic (hopefully caught by previous checks)

So, how do we determine what a “strong” impact is? It’s certainly not with statistical significance! It’s common to see high levels of significance, especially when the data is young.

What we’re looking for is something much less common. So, we’ll be using z-scores for this.

Suppose we’re monitoring the health of the overall conversion rates for an A/B experiment. We might see a visualisation like this as a conversion rate comparison between the two groups:

The above shows a comparison view between the two conversion rates. The statistical significance of the above is just above 99%. Notice there is still some overlap between the values.

The z-score is the distance between the two means (or “averages”). It uses standard deviations as units of measurement.

z-score: the difference between means using standard deviation as units

The z-score for the example above is about +3.5.

I’ll cover z-scores and basic stats in more detail in later articles. But just know that you don’t have to have intimate knowledge of z-scores to use it effectively.

99.9% significance and a z-score 3.5 could be a sign of an issue, but it’s still not compelling enough of a signal. What we’re looking for is higher z-scores. What does that look like?

Here is a view of a z-score of +5.5 (the “significance” tops out at 99.99%):

Notice there’s no longer an overlap. This is a much more compelling signal.

Let’s go further. Here is where the z-score is +10 (again, statistical significance is at 99.99%):

The magnitude of the z-score continues to increase, while the statistical significance tops out at 99.99%.

Here is an example where B is performing worse. The z-score is negative: -10 to be precise.

My pal Tim Stewart first pointed out the value of z-scores to me. Once he mentioned it, I’ve tracked the z-scores of many metrics across hundreds of experiments in order to determine the values to use for the early warning system. Here is what I settled on:

For a high magnitude negative impact, look for a z-score of -4 or less
For a high magnitude positive impact, look for a z-score of +4 or more

The greater the z-score (positive or negative), the greater the magnitude and the less likely that the results are due to chance.

Don’t use this rule for “hits”, though. Hits are too noisy. Instead, only use this on “users” and/or “visits”. Also, make sure you still have a decent amount of traffic (at least hundreds in volume). The lower your volume, the higher your z-score thresholds will need to be.

These rules have helped captured countless number of issues early in the experiment runtime, saving the business time and money. Just remember to use these as signals of issues which need to be investigated.

Not all experiment dashboards show you the z-score, which is a shame. I’ve been spoilt by the fantastic SiteSpect performance matrix which visualises this nicely.

However, don’t worry if your experiment tool doesn’t show it. You can use this calculator instead.

Conclusion

An early warning system is vital to the success of an experimentation program. Without it could lead to weeks of lost time and data, not to mention potential losses in revenue and unwanted impacts to user experience.

Is this a complete list of checks? Absolutely not. But it’s a start. Every company and website is different, which means you might need to add your own specific checks to your early warning system.

For example, if you analyse experiment results in a different analytics tool, you might need to verify the data capture between the analytics packages. Doing so ensures they meet with your standards for reporting accurately.

Finally, you might want to automate as many of these checks as possible. Or atleast document the processes very clearly. Whatever the case, don’t get caught out without a defined process for this!

I’m Iqbal Ali. Former Head of Optimisation at Trainline. Now an Optimisation Specialist, helping companies achieve success with their experimentation programs through training and setting up processes. I’m also a graphic novelist and writer. Here’s my LinkedIn if you want to connect.