The AB testing Cookbook -Part 4

Ibtesam Ahmed
8 min readDec 13, 2023

--

This is the last article in my series called the AB testing cookbook. Before I actually start this article, a brief recap about what we have covered in this series so far.

In the first article we discussed the need for AB testing and why is it necessary for business stakeholders to run AB tests. In the second article, we discussed a few fundamentals you need to know before running these tests. In the third article we covered different hypothesis test, their assumptions and limitations and when to use which.

Finding out whether a test is statistically significant or not is not the end of the road. Statistical significance to launch is a journey in itself and I am going to lay it out for you here.

Before you take your results to the stakeholders you need to verify whether your experiment was trustworthy. There are different things I will introduce here that can lead to misinterpretation of statistical results in AB tests. Let’s go through them one by one.

Lack of Statistical Power

Just because the difference in a metric is not statistically significant doesn’t mean there is no treatment effect. Maybe the experiment is underpowered to detect the actual effect size. Running it again with more users might help you detect that small effect you saw earlier. Make sure when you run it again, you re-randomise the users and the randomisation is independent from the previous one, we do not want any residual or carryover effects.

Peeking at p-values

When using classical frequentist hypothesis test(the tests we covered in part 3), you should define a predetermined interval for which to run the experiment and not act on the p-values before it. Acting on the p-values before it violates the assumption of the test and increases the probability of encountering a false positive(which is 0.05 if you act in the end).

But in businesses it might sometimes become impossible to not do this. In those cases you can use Sequential Hypothesis testing or bayesian methods. In sequential hypothesis testing, the sample size is not fixed in advance. Instead, data is evaluated as it is collected and further sampling is stopped in accordance with a predefined stopping rule as soon as significant results are observed.

Multiple Hypothesis Test

When we test multiple hypothesis at once or the same hypothesis again and again, it increases the chances of type I error or false positives. The different scenarios that can be put under this umbrella are:

  • testing for multiple metrics at once in an AB test
  • test a single metric with different treatment groups
  • testing for a segment of the population while also testing for the entire population.
  • running multiple iterations of the same AB test
  • running multiple AB tests in parallel

Although this seems like a pretty common problem when you are running AB tests, it is easily solvable. The most general and simple way is to divide the p-value by the number of tests and use that new p-value for all tests. This is called the Bonferroni correction and you can use it for any of the above multiple hypothesis scenarios.

If you are testing for multiple metrics at once in an AB test, you can divide the metrics into 3 groups. The first group contains metrics that we expect to be impacted, the second group contains metrics that can potentially be impacted and the third group contains metrics that should not be impacted. Then, apply tiered significance levels, eg: 0.05 for 1, 0.01 for 2 and 0.001 for 3. These rules-of-thumb presented in Trustworthy Online Controlled Experiments are based on an interesting Bayesian interpretation: How much do you believe the Null hypothesis is true before you even run the experiment? The stronger the belief, the lower the significance level you should use.

If you have to combine the results of multiple experiments that test the same hypothesis, you can use Fisher’s Meta Analysis.

Leakage and interference between Variants

AB tests follow a Stable Unit Treatment Value Assumption(SUTVA). To put it simply, experiment units should not interfere with each other. Their behaviour is impacted by their own variant assignment and not by the assignment of others. This assumption gets violated in cases such as social networks where a feature might spillover to a user’s network, a two sided marketplace like Uber where giving incentives to drivers in treatment leads to less rides for drivers in control.

To combat this you can isolate the variants using cluster based, geo-based or time based randomisation. To test a new feature in the case of the social media example, you can create clusters of users based on who they are connected to and randomise these clusters into control and treatment instead of the users.

For the Uber example, if you want to run the experiment in a city. You can randomise on different sub-regions or districts in the city assuming that riders in one sub-region do not affect the riders in the other sub-region.

Primacy and Novelty effects

Some people can be averse to change and some can like it. A treatment may appear to perform well at first, but the Treatment effect will quickly decline over time and the opposite could also be true. While the standard analysis of experiments assumes that the Treatment effect is constant over time. In such cases, experiments need to run longer to determine when the Treatment effect stabilizes.

One additional option to highlight possible novelty/primacy effects is to compare the metric for new users vs old users or to just run the experiment for new users as they won’t be susceptible to primacy or novelty effects.

Sample ratio mismatch

When the ratio between control and treatment in the test is different from the ratio in design, the p-values obtained in the test are not valid. With large numbers, a ratio smaller than 0.99 or larger than 1.01 for a design that called for 1.0 more than likely indicates a serious issue. The matching of the sample ratio should be considered a guardrail when doing AB tests. You get a very low p-value if there is an SRM and the experimentation system should generate a strong warning and hide any scorecards and reports in this case.

Sample ratio mismatch can happen for a variety of reasons:

  • Buggy randomisation: using a bad hash function for randomisation or using browser redirects that are not random or because of some bugs in the data tracking pipeline.
  • Residual or carryover effect: It is common for a new experiment to cause some unexpected egregious issue and be aborted or kept running for a quick bug fix. After the bug is fixed, the experiment continues, but some users were already impacted. In some cases, that residual effect could be severe and last for months. This is why it is important to run pre-experiment A/A test and proactively re-randomize users, recognizing that in some cases the re-randomization breaks the user consistency, as some users bounce from one variant to another. Another solution is to use entirely different users.
  • Bad trigger conditions: In the context of A/B testing, triggering usually refers to the conditions or events that cause a user to be included in a particular variation of the test. The triggering mechanism is crucial for ensuring that the allocation of users to different variations is random and unbiased. The trigger condition should include any user that could have been impacted. A common example is a redirect: website A redirects a percentage of users to website A’, their new website being built and tested. Because the redirect generates some loss, there will be typically be an SRM if only users that make it to website A’; are assumed to be in the Treatment.
  • Triggering based on attributes impacted by the experiment : If triggering is done based on attributes that are changing over time, then you must ensure that no attributes used for triggering could be impacted by the Treatment. For example, assume you run an e-mail campaign that triggers for users who have been inactive for three months. If the campaign is effective, those users become active and the next iteration of the campaign could have an SRM. Trigger conditions based on machine learning algorithms are especially suspect because models may be updated while the experiment is running and impacted by the treatment effect.

Sometimes after detecting SRMs it is possible to fix the cause during the analysis phase. In other cases, some users or segment of users have not been properly exposed to the treatment and therefore it is better to re-run the experiment.

Segmented view of the Treatment metric

While it is interesting to see which segment a particular metric influences the most, no conclusion about statistical significance should be drawn without adjusting for multiple hypothesis test.

Sometimes, if the segmented metric is unexpectedly low or unexpectedly high, it can indicate a bug in your pipeline and you might want to run a sanity check. For example, an experiment tracking the click through rate of a new form of advertisement on tracking the segmented metric finds that the CTR is close to zero for windows user. It could be either because the clicks are not being recorded on windows devices or the customer is not landing on the new page.

Selection Bias

Before concluding anything from the results of the experiment, you also want to make sure there is no selection bias in the variants. In some experiments, there could be a non-random attrition from variants. In an experiment, you may offer all advertisers the opportunity to optimize their ad campaign, but only some advertisers choose to do the suggested optimization. Analyzing only those who participate, results in selection bias and commonly overstates the Treatment effect. The Treatment effect we are measuring should be based on the offer, or intention to treat, not whether it was actually applied.

Statistical significance is not equal to a successful experiment

After you have run the above validity checks and corrected for them or re-run the experiment, here’s how you decide whether the experiment was successful or not.

  • The difference you see between the variants, although statistically significant has to be significant practically as well. The cost of actually deploying and maintaining the solution in production should be justified by the size of the effect.
  • It is rare that all the metrics move in the direction you expect them to. For example if you are testing a new retention strategy by giving more discounts, retention might go up but average profit per order might go down. The tradeoff between these multiple metrics has to be justified.
  • You should also consider the downside of making a False positive if the result is statistically significant. If the downside is huge you might want to consider a lower significance level.

After considering all the things I mentioned above, if you decide to launch the treatment, make sure to not ramp-up 100% at once. Do it in stages while closely monitoring the health of the system and the metric of interest. You might also want to keep a 5% holdout to see a long-term effect on the metric that you couldn’t see in the limited time duration of the AB test.

As you might have already guessed by now, doing AB tests right is a lot complicated than it seems. I tried to keep it very simple and methodical in this 4 part series. This is by no means an exhaustive resource on AB tests, there are still some nuances that I might have missed. This article was heavily inspired by the book Trustworthy Online Controlled Experiments and I will definitely encourage you to read it to learn more.

For any questions or feedback you can reach out to me on Linkedin or the comment section below.

--

--

Ibtesam Ahmed

Full time Data Scientist. Avid Reader. Moody Writer. Amateur Cook.