This Is How to Best Understand Power and Significance in A/B Testing

Kirill Shmidt
Wrike TechClub
Published in
7 min readJul 6, 2020

As a part of the Product Analytics team, I conduct A/B experiments to find out the effect of releases. A/B tests are a hardcore concept, and usually, it is hard to explain all major details to different stakeholders. As we are trying to stay transparent in our processes, I decided to write this article to share two major concepts of A/B tests that we use in Wrike.

In our experiments, we usually compare conversion rates between tests and control groups. But what statistics lie beneath this data?

To better explain this, we need to create a simplified task and find an appropriate analogy.

  • What we call conversion rate is usually called proportion in statistics.
  • Proportion is a value between 0 and 1.
  • Because this proportion is a division of “user_with_target_action / all_users_in_group” it is analogous to a coin flip.
  • When a client converts, we count it as a success or as a “heads,” and non-conversions as a failure or “tails”.
  • Our A/B testing can be boiled down to a simple task: the comparison of two types of coins that have different probabilities of falling on either heads or tails.

Simplified task

Suppose we find out that a coin is broken in some way and that in a coin toss, it has a 40% chance of falling on one side and a 60% chance of falling on the other. Since we can only compare flip outcomes, our task is to see if we can distinguish broken coins from normal coins. We can A/B test by starting a flipping procedure for both coins. We count heads as 1 and tails as 0. So to calculate the probability of a coin flip, we calculate the average.

Let’s calculate the average:

It’s easy to see that the flip results for the normal coin and the broken coin are indistinguishable after a small number of flips but that the results start to diverge after the number of flips increases. Let’s pinpoint the part of the graph between two black lines. By this time, we have performed more than 200 flips and are in a situation where the probability of the broken coin landing heads up was lower than the normal coin.

Now, let’s compare the same graph for the identical coins:

Even when the coins are identical, we can see that the calculated probability of landing heads up can be higher for one coin than for the other. To see this difference, let’s draw a graph of the difference between the two outcomes.

We can see that there appears to be no difference between coin types in terms of outcome. For example, at around 200 flips into the experiment with different coins, we have an identical calculated probability — but for the identical coins, there is a small difference.

Here’s the big question: How do we distinguish between the identical and non- identical coins? In real life, we never know which is which. So how can we be sure that we have drawn the right conclusions?

So let’s go with a basic assumption: We know that the longer we wait the more likely a difference between the two types of coins will emerge. But this assumption also faces into two problems:

  1. Statistical significance: What is the probability that the difference we see is a fluke?
  2. Statistical power: If we see no difference, then what is the probability that we are wrong? What is the probability that we have overlooked the broken coin?

If we perform just 20 flips, we essentially can’t say much about the two coins because we haven’t achieved enough statistical power. In this situation, we also can’t differentiate between any types of coin because there is no statistical significance:

But when we perform numerous flips, we can then meet the requirements for both statistical power and statistical significance.

We see that eventually, the statistics diverge. But what if a situation arises where even with numerous flips, we see a very small difference between the coins?

In real life, we don’t compare the dynamic of probability between the two groups, we compare the calculated probability at the end of an experiment. So, in reality, we may have the following calculated probability for 5,000 flips:

  • Coin A: 0.5168
  • Coin B: 0.4976
  • Difference: 0.0192

But what’s interesting about this probability is that they are actually samples from general distribution. So if we repeat the experiment, then we will get slightly different numbers. How would they vary? We can simulate it. In this simulation, we make 5,000 repetitions.

With 5,000 flips so we can see the variation in calculated probability:

So, from this data we can infer simple criteria: if we want fewer cases, then the probability for variable A is smaller than the probability for variable B. Usually, we use a 5% threshold, so we say that values are different if they overlap in less than in 5% of cases. We don’t need to simulate it every time, because we have a theory (represented by the black areas here) that allows a correct calculation from basic numbers.

Here, we have 2.54% cases of difference that are lower than 0. This is lower than the 5% threshold. This means the difference is significant between the two groups with a confidence level of 95%.

We can, therefore, infer statistical tests with a 95% confidence level to groups with an actual 0.02 probability difference. But how often will this test give us negative results? In this case, we will have come up with a false negative because we knew about the difference beforehand. In other words, what is the probability that we’ll overlook a difference where there is one?

In our real-life experiments, we actually have two coins with calculated probability, so we should know in advance the probability that we will not see any effect.

If we make 750 repetitions of the experiment and compare how often we get positive and negative results for our significance test, then we find that for our set (5,000 flipping instances, with coin A = 0.52 and coin B = 0.5) we get a false negative in 51.7333333% of cases.

So this means that we have an almost 50/50 chance of finding no difference in the actual test. Value 1, or the false negative rate, is called statistical power. Here we have 0.4826667 statistical power.

Power intuition

Let’s uncover some more insights about statistical power.

We can vary:

  1. The number of samples
  2. The difference between groups
  3. The baseline conversion rate

Our context

In a test, we usually want to achieve at least an 80% level of power. Still, this means that if we run 12 experiments in the next year and all of them are successful, we wouldn’t be able to detect a difference in 20% of cases. We will get only 9 to 10 out of 12 experiments where we will actually be able to see a difference.

For an experiment with typical baseline of 10% and expected 0.5% absolute difference, we have the following picture in terms of sample size:

We need around 55,000 confirmed emails in each group (so 110K confirmed emails) to see a 0.5% difference.

Conclusions

Experiments should be run with the following in mind:

  1. We want to set a high probability to see differences where they exist -in other words, to higher the power. To do this, we should:
  • Increase the sample size.
  • Search for higher impact tests with a high potential difference between groups.
  • Run experiments with higher baselines.

2. We also want to distinguish between experimenting groups to achieve high statistical significance. To do this, we should:

  • Increase the sample size.
  • Search for higher impact tests with a high potential difference between groups.

Here are some links to learn more about power and significance calculation:

Photo by Jason Dent from Unsplash

--

--