The common flaw in A/B testing

This is Part 2 in my Testing Series. (Part 1)

Let’s start with a story — it may be familiar to some of you.

A few years ago, I was working at a very testing-heavy email marketing based startup. We tested everything. You name it, we tested it; colors, subject lines, numbers of buttons, copy, format of emails, send time, friendly froms, punctuation, placement of every single piece in the email, etc.

There were some tests that would have 200+% lift in clicks — you’d think that with results like that, I would be able to tell you that we had an optimized, amazing email that worked like clockwork, right?

Not so much. There was one fundemental flaw in our tests, and that flaw’s baked into a lot of AB testing suites and is almost never discussed in ‘How to AB test.’

We’ll get to the problem, but first… Let’s talk about testing.


What is an A/B test?

Chances are, you know the basic answer to this — An A/B test compares two (or more) different versions / experiences for users. Users’ interactions with the versions are tracked using metrics, and the version with the better metrics is declared the winner. As discussed in Let’s Talk About Testing (Pt. 1), good tests should be structured to provide learnings about the users — but either way, A/B testing is a battle to the death between different versions of an experience.

This is basically what your creatives are doing. (The Hunger Games, Lionsgate)

If you know a little more than the basics, you might know something about statistical significance — basically, that’s the concept that the results of your test are different enough that there is a definitive, statistical difference in your results.

There’s a lot of calculators online, and basically what they’re looking for is this:

Is the difference between the conversion rates between side A and side B extreme enough that there’s a 95% certainty that the difference will hold when you scale up the experiment?

Whenever you run a test, you should look for statistical significance. It exists to help guard against flukes.

Remember that startup example? Always, we would get statistical significance. Still, the lifts disappeared.


So what’s the problem?

Tests — Scientific tests, correctly run, that use the scientific method — have a set sample size (number of interactions.)

What currently happens in almost all A/B tests — and what happened at my previous job — was that as soon (or soon after) as the test reached significance, the test was called. All testing was stopped, the winning side had won. Hurrah! … and then the forecast rise in metrics never really made it into reality. It’s impossible to know what you’re missing, because once you end the test, you don’t see the losing side anymore. As marketers, we can just rest on our laurels, knowing that our wins have gotten rolled up into the great big ball of ‘optimized marketing’ we all aim for.

This is actually an ongoing issue within the scientific community, too; cherry-picking data to make studies more exciting also meant that a group of scientists that tried to reproduce the tests weren’t able to do so more than 60% of the time.

If you take your data as soon as it’s significant and ignore the sample size, you haven’t eliminated the ‘fluke’ aspect. The results may be 95% different from one another right now, but that doesn’t mean that they’ll be 95% different going forward.


What should you do?

First: Use a calculator (like this one from Optimizely) that explicitly says how big your sample sizes need to be per variation.

Second: Don’t resolve your tests early as soon as they’re statistically significant, without looking at the sample size (and using something like the calculator above). If something is performing catastrophically or amazingly right now, that doesn’t mean that they will in the future — that’s one of the reasons you’re running the test.

Mailchimp’s A/B testing software sends each side to a minimum of 5,000 people, and calls the test after a minimum of 4 hours. If only 25% of your opens are recorded in those 4 hours, it’s making a ‘which side wins’ decision on a sample size of 250 people — and that’s 250 people who opened your message, not converted. That’s too small to actually be statistically significant.

Third, and the biggest takeaway: Think about your testing. What are you trying to learn? It’s worth putting some time in to make sure your results are clear and actionable, and what results you would expect to see if your hypothesis (which you have, right?) was true.

Bonus: Listen to this episode of Freakonomics. It goes more into depth on the scientific non-reproducible studies, and really puts all of this into an interesting, understandable light.

Next week: How to get Qualitative Insights out of your testing, instead of just Quantitative ones. Questions? Need testing advice? Hit me up in the comments, or on Twitter (@asoehnlen)