Don’t Be Fooled by Your Testing Tool — Take a Second Look!

Published in

Project A Insights

4 min readMar 6, 2014

The concept of A/B testing has been around for some years now, so it’s far from being the latest marketing craze. We even came to the conclusion that it has to be treated with caution.

A/B testing describes the act of testing 2 variations of one website and comparing the performance to determine which one produces the most sign-ups, sales, leads, etc. For example, suppose you own Groupon and have a webpage where you encourage non-members to become members. You could perform an A/B test on the sign-up page with different variations in copy and pictures. One variation might have more of an emotional focus while a different variation rationally stresses the advantages of becoming a member. You then split the traffic in half and count which version does a better job at motivating non-members to register.

Before A/B testing became mainstream, Google relied on it to develop and refine their internal products, and engineers ran their first A/B test on February 27, 2000. However, one of the earliest and best-known examples was the A/B testing used in Obama’s 2008 Presidential Campaign. Dan Siroker, founder of Optimizely (now the most widespread A/B testing tool) played a leading role in turning Obama’s website visitors into subscribers with the hope that follow-up-emails would eventually convert them into campaign donors. By testing variations of the register-button-text, and video vs. images, Siroker and his team increased signups by 140% from the original version of the website. This resulted in an estimated 4 Million more subscribers, and an extra $75 Million in campaign funds. Today, it is easy to find many A/B test examples that show how small changes can have big impacts.

As demonstrated from the above example, quantitative testing should be a core focus in product development. However, for young startups it’s often impossible to gather the amount of data necessary to achieve results with a high level of statistical significance.

A/B testing service Visual Website Optimizer offers a tool that helps calculate the number of days a test should run. For example, when an e-commerce website has 2000 visitors per day with a conversion rate of 2%, a simple A/B test will need to run for about one year to give significant results.

In practice, there is often an even bigger problem: It may happen that an A/B testing tool reports statistically significant results when the results are clearly not significant. How can one even detect this? The key is doing A/A — B/B tests, which means having each variation twice in the test. Now, if you see significant differences between two identical variants, you can be sure that what seems significant from the numbers in fact is not. Here is one real-world example:

The “Original Control” shows exactly the same page-content as the “Original”, but the A/B testing tool reports a significant uplift of 9.2%. Furthermore, this does not happen after the first couple of conversions, where such effects are likely, but after more than 600.

And just when you think it can’t get any worse, it can. We recently conducted a test that resulted in highly significant uplifts on both B variations for one goal. Sadly, after some weeks this result could not be replicated anymore.

So what are we doing at Project A to overcome such obstacles? The above examples have taught us some lessons, and we have established a set of criteria that an A/B test has to fulfill in order to deliver credible results (See here for a recent paper listing the most common mistakes when performing A/B Tests):

We do thorough quality assurance for each test on major browsers and mobile devices. We restrict the target audience only to browsers where the test runs well.
We exclude all IP addresses from our office so we don’t influence the test results with internal usage.
We only look into test results after one week runtime for the first time (this is the hardest part!) and we wait for each variation to have at least 500 conversions.
We always double each variation (A-A/B-B testing), and if we see differing results within one pair of variations, we consider the test as failed.
We replicate successful tests (this is usually done when we try to iterate on promising results. Sometimes this shows that a result cannot be replicated).

To conclude, am I arguing against A/B testing here? No. I am arguing against the overstatement of the powers of quantitative testing. For large websites there is a lot to learn by means of systematic testing. For smaller startups direct qualitative customer feedback is often way more valuable, as this can even provide insights into the reasons of customer behavior.

Don’t Be Fooled by Your Testing Tool — Take a Second Look!

Written by Simon Deichsel