Extending A/B tests to reach significance: the trap even the smartest people fall into
A room full of smart people with access to a boatload of information on the internet about A/B testing wouldn’t make basic mistakes when experimenting, would they?
In our case, absolutely. Our room full of smart people with access to a boatload of information made a boatload of mistakes in the early days of experimenting.
In 2019 one the biggest mistakes teams across Gousto kept on making was extending tests to try and reach significance, while using the same basic online statistical calculator and 95% significance threshold each time they extended. Experimenters would run a test for a pre-determined time to collect a pre-determined sample size. They might then do two things:
- If it was insignificant but close to 95% significance by its endpoint, they would extend it for more weeks to try and reach significance.
- If the test achieved 95%+ significance on our usual online statistical calculator after extending it, they celebrated and banked the result.
When I read up to try to understand whether this practice was acceptable, I found many others who believed this was a valid approach to A/B testing. Spoiler alert: it is not! When you extend tests to achieve results, you invalidate the rigour of your A/B testing program unless you use a different statistical model or significance threshold for your test when you extend.
A quick note on Gousto. We’re the UK’s largest recipe box business, still growing our triple-digit revenues by close to 100% y/o/y. There are many experimenters at Gousto, but I lead the Growth team: a 7-person Growth Marketing team working alongside an analyst and a 10-person tech squad consisting of web and app developers, a designer and a product manager.
Over the past year, the Growth team at Gousto has run over 150 A/B tests. This cadence of testing is up from less than 12 in the previous year. We’ve also challenged ourselves to learn and institutionalise best-practice experimentation processes. We now feel we are in a strong position when it comes to approaching A/B testing correctly. However, it took us many months to reach this state, and we continue to learn every day.
Extending A/B tests to reach significance: the context
The statistical issues that come from stopping your test early when it first becomes significant are well documented: two great articles are Evan Miller’s and Defazio’s. However, it is surprising that almost no one has written about the statistical invalidity of extending tests beyond your calculated sample size to reach significance. Defining your sample size in advance is often called out, but people don’t specifically address concerns around test extension. Because of the lack of explicitly condemning material on the subject, extending tests seems a far more prevalent and insidious behaviour than stopping tests early.
The arguments for extending experiments usually consist of the following two points:
- If I’m accumulating even more data, shouldn’t that make the experiment even more accurate?
- Given it’s so hard to identify the right minimal detectable effect before the test, what’s wrong with extending it when it looks like the true effect is a little lower than we thought?
The way to disprove both of these arguments would be by simulating the practice of extending tests when they don’t reach significance but using the same statistical model and significance threshold, like Defazio does for stopping tests early. But I lack the statistical chops to do so. If anyone else can, please help me out! Instead, I plan to refute those arguments with a whole lot of references and two fun thought experiments.
If you’re using basic online statistical significance calculators, you need to collect a fixed sample size.
Fixed sample sizes in advance is a critical assumption of frequentist A/B testing and most of the online A/B test significance calculators. When you change your sample mid-test or and the end of a test, you violate this assumption and lose the ability to lean on the scientific rigour these statistical tests give us.
I cannot prove this through mathematics, nor do I think it would be useful to do so in a blog post. But I can reference respected individuals in this field who are crystal clear in the requirement to fix sample sizes in advance for frequentist A/B testing.
A whole lot of references that refute changing sample sizes mid-test without changing your significance calculation
- Evan Miller: the creator of some of the great pre-test analysis calculators
- Georgi Georgiev specifically addresses this question in his comment here and more widely clarifies his view on fixed-sample testing here. He is a writer of a very detailed book on experimentation and writes extensively on his excellent blog.
- Callie McCree and Kelly Shen from Etsy, both MIT Mathematics graduates
- Ramesh Johari, Associate Professor at Stanford and Optimizely advisor
And despite hours of research, I am yet to find someone who concretely advocates that extending after you’ve collected the planned sample size for the fixed-sample approach is acceptable.
There are alternate approaches where you can check multiple times for significance. Decades ago, Pocock (1977), O’Brien and Fleming (1979), Kim and DeMets (1987) and others worked out ways to check tests regularly for significance for medical trials while maintaining their statistical integrity. More recently, Optimizely released its new stats engine, and Georgiev created the AGILE A/B testing calculator on Analytics Toolkit. But all of these approaches use different statistical models to traditional fixed-sample frequentist models so that they can check results repeatedly as they collect more data.
And if the opinions of some of the smartest people in this space do not convince you, consider the two thought experiments below…
Our first thought experiment: running a 1-week test, and extending each time you haven’t reached significance
Let’s assume that you plan to run a test for one week, based on a given MDE, baseline CVR and weekly sample size. And at the end of the week after a full business cycle, you have not reached significance and decide to extend the test. You check for significance three times in total, extending the test for two more weeks above the original run-time.
Is the actual above behaviour different from running a 3-week MDE and peeking each week to check for significance? It’s the same! And when we do the latter, we know the implications that peeking has increased false-positive rate if you maintain the same significance threshold. Each time you peek with intent to stop a test you need to increase the level of significance required to keep the false positive rate at the same 5% level (see Evan Miller’s table below for a rough illustration of this).
Our second thought experiment: extending tests even when you have achieved a significant result
Let’s assume that you run a 3-week test, and at the end of the test, you have a p-value of 0.045. Hooray!! You’ve passed your significance threshold. But wait, why not extend the test further to reduce our chance of a false positive as we were only 95.5% significant? It’s what we were doing with tests that were 94.5% significant, so why not the other way around?
I’m sure we agree it would be madness to do so. But the example above illustrates how we bias towards successful results when extending tests that are beneath the significance threshold, but not extending them when they are above it.
As shown in the diagram, when you achieve a positive result, you stop. But when you observe a negative effect, you ‘roll the dice’ again to see if that result becomes positive. And with each roll of the dice, you increase your chance of getting a false positive. More data will not decrease this chance. You have the same chance of getting false-positive at a given significance threshold with whatever sized dataset you collect!
If we don’t extend tests, what can we do instead?
Stopping tests even when they are close to significance is a necessary part of the frequenting approach to significance testing. We can’t lean on the scientific rigour of frequentist approaches when it suits us and ignore it when we are close to a result we want. We can’t have our cake and eat it. However, there are three ways we are navigating frequentist shortcomings at Gousto, which may be relevant for others.
1. We recognise that statistical significance is not the only tool in our toolbox of inductive reasoning.
Based on other tests or different kinds of data, we can draw evidence-based conclusions beyond the hardline success/failure result of a specific A/B test. I can’t publish ways we’ve done this here as it involves confidential data, but if you DM me I can talk you through it!
2. For customer-experience improving tests, we reduced the required significance levels from 95% to 90%.
It was gutting to close tests as failures when they had moved retention with 92% significance by the end of the test, even when they had moved other retention correlating metrics like app downloads with 99% significance. We didn’t feel like 95% reflected our risk appetite when rolling out customer-experience positive changes, where we could see only upsides from getting them to perform the actions we wanted them to do.
On the other hand, when we experiment with more costly changes, we continue to use the 95% threshold. An example of a test like this would be increasing the number of recipes we offer on our menu. Changing the number of recipes we offer comes at considerable expense to the throughput of our factory and other cost efficiencies in business, so we require a higher statistical bar and lower confidence intervals for retention uplifts that we model rollout scenarios on.
3. We are investigating alternate methodologies, but are still early in the journey.
Earlier in the post, I mentioned other methodologies that enable you to run tests until they reach significance, without having to calculate or adhere to fixed sample sizes. We believe such approaches are beneficial for certain kinds of tests, for example, sequential A/B testing for low conversion rates. But we have not invested a significant amount of time in understanding both the technical implementation of these approaches and the estimation of test impacts these approaches identify as significant. As a consequence, we continue to use the reliable frequentist approach, but are hoping to innovate shortly — and will let you know when we do!
I hope you enjoyed the post. If you want to get updated when a new post comes out, sign up to my mailing list. Takes less than 5 seconds!