Statistical Significance Pitfalls in A/B testing

Daniel Chatfield
3 min readSep 14, 2015

--

Statistical significance is used to determine whether some experimental result is significant enough to support a conclusion. I’ll try and show the intuition for it and highlight some common pitfalls and then explain how this affects A/B testing.

Suppose you believed a coin was biased and after tossing it 5 times you got 4 heads, you now want to answer the question of “is this enough evidence to prove it?”. First we have to decide what “proof” is, this depends on the context — the degree of proof is wildly different if you are trying to convince a friend of something rather than a jury. A friend might accept a statistical significance of 20% (i.e. there is a 20% chance that the hypothesis is wrong) but a court might require as little as 0.001%.

So, sticking with the “covincing a friend” statistical significance level of 20% what can we say about our results. The probability of an unbiased coin coming up heads 4 times out of 5 is 15.6%, so this is statistically significant?

No — well maybe, but our reasoning isn’t solid. Instead of the probability of getting 4 heads out of 5 we need the probability of getting at least 4 heads out of 5. To see why, imagine I tossed the coin 100 times and it came up heads 51 times. This isn’t a very remarkable result however the probability of this happenning is just 8%. Correcting our error, we obtain the probability of getting at least 4 heads (18.75%). So, still within the 20% so this is statistically significant?

No — Our reasoning still isn’t sound. Had we set out to demonstrate that the coin was biased towards heads then we would be done but we didn’t — we set out to demonstrate that it was biased, period. What would have happened if we got 4 tails? We would have done the exact same calculations but instead argued that it was biased towards tails. In fact there is only 2 combinations that wouldn’t have resulted in us thinking we had proved it — 3 heads and 2 tails or 2 heads and 3 tails. Granted, these combinations have the greatest probability but not the (100%–18.75%) that it should have if our reasoning is sound.

So, let us correct our mistake and calculate the probability of getting at least 4 of either heads or tails. It is 37.5%, — not unlikely at all and almost twice our (already generous) statistical significance level.

Applying this to A/B testing

Suppose I have 100 coins and I toss each of them 8 times, on average I can expect that 1 of those 100 coins will come up the same all 8 times. If I then look at this coin in isolation and work out the probability of it coming up either heads or tails all 8 times (0.78%) it looks like a statistically significant result. However, I am forgetting the context of this result — it was one of 100. It is not unreasonable to get a 1 in 128 chance event happen when doing 100 of them.

Statistical significance has roots in academia, where you typically come up with the hypothesis before the experiment. It is common in A/B testing to just make a change and then look at the hundreds of metrics to see what differs — if doing this then you must either work this into the calculations or simply run another test afterwards with the sole goal of proving one hypothesis. I think you’ll be surprised at how many of your “statistically significant” results disappear.

--

--