98% of AB tests deliver zero value (or worse)

Sharon Biggar
Jul 12 · 4 min read

A/B testing is seductive. The claims of revenue increases are bold and abundant. The internet quite simply overflows with examples of where companies have achieved double or triple digit improvements in their metrics as a result of A/B testing.

In fact, A/B testing has become so prevalent that those who aren´t A/B testing are implored to feel that they are “missing a trick”.

But is it really such a “silver bullet” to success? No!

The reality of AB testing is more fraught with error and more costly than most internet articles would have you believe.

For example, at Social Point we ran 71 AB tests and from that we could only prove that 1 was truly successful. I dont mean “successful” in the easy definition of that term (i.e. we ran something, got a result and hey look Group A is better than Group B).

No, I mean really truly successful in the sense that we ran a test, got a statistically significant result, implemented the winning group in production AND saw a real increase in real revenue. That is success!

In order to run AB tests that deliver value to your firm, you must decide on your definition of success.

Defining success in AB testing

At Social Point we developed a 10 point scale for measuring the success or failure of our AB tests. The scale was constructed on the basis of the “return on investment” for Social Point of undertaking the test.

The best possible outcome of an A/B test is that the winning variant is included within the core product and the positive impact on financial results that were seen during the A/B test are replicated (or improved) in production.

In other words, let us imagine that we are choosing between leaving the product as it is or alternatively, including a Feature A. To make this decision we run an A/B test and find that including Feature A increases customer lifetime value by 20%. With these results we go ahead and make Feature A live for all users and discover that Feature A has increased overall customer lifetime value by 25%!

This is success. An A/B test with this outcome warrants a score of “1” on the ranking system.

By contrast, the worst possible outcome is that after running the A/B test we launch Feature A and discover that customer lifetime value reduces by x%.

This is failure. An A/B test with this outcome warrants a score of “10” on the ranking system.

All other possible outcomes are between these two extremes. The table below (Table 1) describes the scoring system and the benefits and costs associated with each outcome.

Two additional points are worth making. These are:

  1. An inconclusive test is not a failure:

Inconclusive tests teach us that the Group B (the Variant) may not negatively impact the financial results. But only IF the test was run with enough power.

If you are certain that you have given the experiment a sufficiently large sample size, than you can learn something even from an inconclusive result. At the very least an inconclusive result shows you that Group B is not (statistically) significantly different from Group A.

Probably you wont increase your revenue by including feature B, but you also probably wont decrease it, so you could go ahead and put it into production to give your users something fresh (or not, it really makes no difference).

2. “Experiment time” is a cost to the firm:

Most AB tests require a large sample size. (And in free-to-play mobile gaming, tests on monetization require a REALLY large sample size).

While more than one A/B test can be run at the same time there is a finite number of tests a given game/product can perform in a year.

For this reason the time given to an experiment has a cost, and every experiment should contribute at least a learning to the firm that compensates for that cost.

Table 1: 10 Possible Outcomes for an A/B test (from best to worst)

We can take a summary approach to the above table as follows:

  • Scores 1–3 are net positive — that is, the A/B test led directly to financial gain, or it prevented the company from putting in place a Variant which would have had negative financial results;
  • Scores 4–5 are neutral — the costs and time associated with running the A/B test can be considered to be balanced by the benefits of the learnings gained from the results:
  • Scores 6–10 are negative — The company incurs cost but no gain from the A/B test.

Of the 71 tests that we ran: 25% (18) were net positive for Social Point, 37%(26) neutral and 38%(27) were costly.

Only 1 test can we say for sure was an unqualified success.

Stop kidding yourself that you are running a successful AB testing programme (unless your tests are leading to increases in real revenue, in which case you are and I congratulate you!).

Define what success in AB testing means for you, and take the first step to designing and running an experimentation programme for your company that delivers true and lasting value.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade