The trap of A/B testing for too long.

Published in

The Startup

7 min readJun 26, 2020

A/B testing is an extremely common test that companies run to compare different versions of their products to see which is performing better. A lot of the literature on A/B testing often recommends that the test should be run over a long period so that it can be made certain without doubt about the statistical significance of the result achieved.

While getting more data points and results over a longer duration is often a good thing, there could be some issues that arise in terms of underlying discrepancies in the way the data was collected, how the results were analyzed, and finally how inferences were made regarding the collected results.

I have created one such simulation of an A/B test to show how running an A/B test over a long duration is not always a good thing and how looking at the results of the data across the whole duration might lead to skewed inferences.

Understanding The Data

This data is a simulation of data collected by a company wanting to test how changing a particular aspect of the website leads to a customer buying a product which is termed as a conversion.

Group A is our control group — In this case, it is the people exposed to the currently existing website.

Group B is our treatment group — In this case, it the people who are exposed to the modified website.

Users were split into these two groups with a 0.5 probability.

The data spans across 3 days. ie. 1st of January, 2020–3rd of January 2020.

Here in the Conversion Status column 0 implies ‘Not Converted’ and 1(not seen in the image) implies ‘Converted’.

Now that we have understood the data let’s statistically analyze the results of the A/B test in 3 different scenarios. We will be using hypothesis testing for our analysis.

Experiment 1: Test across all days and the complete population.

This experiment is what usually would happen in the case of an A/B test. The company would run the A/B test across multiple days and then statistically compare the results generated.

Pivoting across the Groups we see the result generated from our data as follows:

We now have two different conversion rates for our two groups. Now, how do statistically infer that these two are different?

We perform a hypothesis test!

Hypothesis Test 1:

Null Hypothesis: There is no difference between the two conversion rates.

p1 = p2

Alternative Hypothesis: The two conversion rates are statistically different.

p1!=p2

Significance Level: 0.01

With the given information we need to now calculate our Z-Statistic using the below formula:

The Z-Test is a statistical test used to compare the means/proportions from two different populations when the variances are known and the sample size is large(usually above 30 to be considered a normal distribution). The Z-Score we get from this test a number representing how many standard deviations above or below the mean the number derived from the comparison of the proportions is. If the Z-Score lies within the cut off(significance level’s Z Score) then we fail to reject our null hypothesis, else we can reject it.

In our case:

p1 = 0.05002 | p2 = 0.056302 | n1 = 4938 | n2 = 5062

p = 0.0532 (This is the conversion rate calculated across the whole population including both A and B groups together)

Putting these values into the formula we get a Z Score of -1.399. On converting this score to a p-value we get 0.1617.

0.1617 > 0.01

Therefore, we fail to reject our Null Hypothesis.
Inference: The two proportions are not significantly different and hence the treatment is not effective.

This basically means that introducing the new features to the website did not do the company any good and it is just as good as the old one.

Fair enough. However, let’s look at the next experiment.

Experiment 2: Test across all days and only the morning population.

Let’s say the company ran this test only in the mornings. Here, mornings are defined as between 8 AM and 4 PM. We then perform the same steps as above.

Pivoting across the Groups we see the result generated from our data as follows:

Hypothesis Test 2:

Null Hypothesis: There is no difference between the two conversion rates.

p1 = p2

Alternative Hypothesis: The two conversion rates are statistically different.

p1!=p2

Significance Level: 0.01(This is an assumption which can be varied according to the use case)

In our case:

p1 = 0.053896 | p2 = 0.023621 | n1 = 1707 | n2 = 1735

p = 0.038640 (This is the conversion rate calculated across the whole population including both A and B groups together)

Putting these values into the formula we get a Z Score of 4.60609. On converting this score to a p-value we get 0.000004.

0.000004 < 0.01

Therefore, we reject our Null Hypothesis.
Inference: The two proportions are significantly different and hence the treatment is significantly different, in our case if we were to perform a left tailed test, we would see that the conversion rate of our treatment group is significantly less than that of our control group.

That’s weird. This is not in accordance with our results from the first test.

Let’s carry on and perform one last experiment.

Experiment 3: Test across all days and only the evening population.

Let’s say the company ran this test only in the evenings. Here, evenings are defined as between 4 PM and 11:59:59 PM. We then perform the same steps as above.

Pivoting across the Groups we see the result generated from our data as follows:

Hypothesis Test 3:

Null Hypothesis: There is no difference between the two conversion rates.

p1 = p2

Alternative Hypothesis: The two conversion rates are statistically different.

p1!=p2

Significance Level: 0.01

In our case:

p1 = 0.05153 | p2 = 0.101725 | n1 = 1635 | n2 = 1681

p = 0.07629 (This is the conversion rate calculated across the whole population including both A and B groups together)

Putting these values into the formula we get a Z Score of -5.59283. On converting this score to a p-value we get 0.0000000223.

0.0000000223 < 0.01

Therefore, we reject our Null Hypothesis.
Inference: The two proportions are significantly different and hence the treatment is significantly different, in our case if we were to perform a right-tailed test, we would see that the conversion rate of our treatment group is significantly greater than that of our control group.

Now how do we explain this?
When we performed the test across the whole population we found that the two conversion rates were not significantly different from each other. However, with the second two experiments, we see that the treatment group’s conversion rate is in fact different, and based on the time of day it could be different in either direction.

Now, how does a company deal with these confusing results?

If the company decides to just stick the old website, then they are losing out on the amazing results that those new features were providing them on the bunch of people that signed on in the evening.
However, if they move to the new website they lose out more customers than usual in the people who sign on in the mornings.

The best way to handle this would be to deploy different versions of the website according to the time and get the best possible results.

However, the take away from these experiments is the fact that if the company had only stuck to the first experiment where they tested the whole population across all 3 days then they would be losing out on those extra conversions generated in the evening by switching to the second website.

So how does one know how many splits to make and how long is too long when it comes to A/B testing? One solution is by the implementation of Causal Inference while making inferences on the results generated.

What is “causal inference”?

Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect.

In this case, the time of day acts as a causal connection towards drawing customers to buying a product on the website with new features. It is important to understand why that is the case. It could be that the demographic of the people who sign-in in the mornings are different from those who sign-in in the evenings. It could be the difference in the demographic which is the true cause for the new website to be performing better in the evenings.

Whatever the reason, as a company who is releasing an A/B test it is important to delve deeper into the results and make conclusions based on a deep causal understanding of why a particular result is taking place.

PS: This experiment is a pretty good example of the Simpson’s Paradox which might be an interesting read.

Find the Jupyter notebook with the data generation and the analysis here.

Thank you for reading!

You can follow my work and other posts here!

The trap of A/B testing for too long.

Understanding The Data

Experiment 1: Test across all days and the complete population.

Experiment 2: Test across all days and only the morning population.

Experiment 3: Test across all days and only the evening population.

Written by Abhijit Menon