What not to do when running A/B tests

The six most common mistakes on experimenting and how to avoid them

--

Suppose that you work at an awesome startup that delivers features every day. Your Product Manager had a new idea that maybe will bring new customers to your company. Your Software Engineers develop this feature and ship them into production. Then, you start to receive new customers and this is great! However, you remember that your marketing team launched a new campaign at the same time. Then, you start to ask yourself: “Were these new customers acquired due to the new feature? And does this feature cause a positive impact for them?”

Looking for signs of one process influencing and causing an effect is the job of many people, especially in tech companies. A way to scientifically study this phenomenon is to run experiments, one of the methodologies for that being A/B testing.

What is an A/B test?

An A/B test consists of separating users in two groups, A and B, to understand the impact of a specific treatment in a variable of interest. Since there are many articles and books about the principles behind A/B tests (such as the ones in the references section below), this article will follow a different approach on the topic.

The importance and power of A/B tests is already well-known in the market, but it is very common for teams either to not know how to start an experiment or, maybe even worse, assume they know it all and come to wrong conclusions. This article is an attempt to alert the reader about possible mistakes when performing an experiment and prevent them from happening.

Also, I want to briefly give a huge thank you to Carolina Cavalcante, Bernardo Loureiro, André Barbosa and Fernando Paiva who helped to write this piece.

So, without further ado, here’s what you should not do when running an A/B test.

1) Not choosing the right statistical solution for the problem
As mentioned before, because by now it is common sense that A/B tests can help the business, more often than not we see people who, as soon as they heard the term, want to implement it in their very next project. But maybe the problem still needs structuring and what you are trying to measure will not be fit for an A/B test.

Take as an example the user conversion as a variable of interest. If you want to understand the effect of a product feature on the conversion of all of your users, without any kind of variation of the feature or in the groups’ demographic, you simply can’t use A/B testing. You could, however, use a one sample test, which is another solution and different from A/B tests.

In this methodology, it is given that you need two groups and the factor which effect you want to measure should be the only difference between them. If you do not have that in your problem, it is best to go for another solution according to the number of variables and groups you wish to study.

There are many other statistical tests such as one sample T-test for populations with known and unknown variance, ANOVA, Chi-squared tests, and I’m sure you will find something adequate for your problem.

2) Not correctly designing your experiment beforehand
The other consequence of rushing your experiment is to not design it correctly. What this means is that you will not consider all the possible outcomes of the experiment and maybe the results will be inconclusive or worse, bring you to wrong conclusions. This could cost you and your company time and money.

When talking about designing an experiment correctly, there are some core variants to be considered, not only the variable of interest and the treatment and control groups, but also how the subjects will be assigned to each group, the null and alternative hypothesis, the significance level, how long will the test run and any other particular characteristics of your test.

To hurry on this step can bring conclusions that might seem science-like and useful but actually will not represent reality.

There are many ways to design your experiment correctly and it is a whole area of research in the field of Statistics, so I left references for you to take a deeper look.

3) Not sampling the population correctly
Not sampling correctly can result in a biased test. When talking about sampling, there are two main types of bias: randomization and selection. Randomization bias is when the groups are not evenly distributed among the factors that might influence your outcome variable, or covariates. Selection is when the visitors assign themselves to one of the groups.

With A/B testing, it is important to make sure that the groups are distributed evenly among the covariates and, as much as possible, differ only in the interest variable. In order to pursue causality, it is important to correctly sample the treatment and control groups. That will confirm also the assumption that the sample is representative of the population.

4) Stopping the test when statistical significance is reached (aka p-value hacking)
One very common practice on A/B testing is to monitor your test results as the test is still running and once the desired statistical significance is reached, the test is stopped. This could lead to early stopping of the test, not enough sample size and wrong conclusions. That is, maybe you will see an effect, even if there is none and it is very likely that you’ll think many insignificant results are significant (but not the other way around) (Overgoor, J. 2014).

Besides that, statistically speaking, the bigger your sample gets, more likely you are going to reject the null hypothesis, especially if your null hypothesis is a simple one (value = x), so it is also important to not run the test indefinitely.

If you monitor the p-value convergence it is important to not stop your test after hitting significance, just to make sure it is not noise. In a summary, do not believe the test results right away and wait until the p-value variance is small and it converges to one value.

5) Analyzing results out of context
It can be very tempting to proceed with the business actions set earlier according to successful results. However, it is important to verify if there are other variables in the context that maybe are not measured in your experiment and that were not able to measure after all. For example, seasonal consumer behavior, risk appetite or a global pandemic.

6) Running tests only once
This approach can be skipped if you did not make the mistakes discussed previously, but if you slipped on one of them or even if you are just really skeptical, it is a good practice to run the tests more than once.

As one of the principles in Classical Statistics, making several experiments and studying the convergence of them can be another tool to guarantee the reliability of the test.

Conclusion

Hypothesis testing is a huge field in Statistics and it would be impossible to cover all of it in one article. Hopefully, you are taking away some good practices for your next A/B test but just have in mind the main message which is to just take it slow and design the experiment carefully. I also hope that you help spread the word and prevent the same mistakes from happening under your watch. Keep an eye on this blog for future articles covering similar topics.

References

For the article
Experiments at Airbnb

Concepts mentioned
A/B Testing Guideline
Simple Sequential A/B Testing
Formulas for Bayesian A/B Testing
D. C. Montgomery, Design and Analysis of Experiments, 7th ed., Hoboken: John Wiley, 2009.
M. H. DeGroot, Probability and Statistics, 3rd ed., Boston: Addison-Wesley, 2002.

--

--