Why I’ve stopped grumbling about vanity A/B tests

I used to complain about things I called “vanity A/B tests”.

Those were tests I felt were the equivalent of vanity metrics. They were experiments that had wholly obvious, entirely predictable outcomes — so just seemed a way for a product manager to feel good about themselves and show off to stakeholders, rather than genuine exercises in discovery. They could even delay getting improvements out to users.

Typically they would involve testing an existing Page That Was Obviously Broken against a new Page That Fixed That Thing.

In the “Which of these looks most like an angel?” test there might be a clear winner.

So you could take a page where, for example, a call to action button was positioned so it was nigh-impossible to click on mobile. Instead of just fixing it, a team might announce it was A/B testing a new page with the new button. Then, after extensive testing, voila — the mobile metrics for the new page would be conclusively shown to be better. Pats on the back all round.

But to me that felt like indulgence. If a thing was obviously broken for your users, you should just fix it for all of them ASAP (and — in the highly unlikely case that fix accidentally introduced something that caused other unexpected problems — well, you’d see it in the metrics you were tracking and could always revert).

Unnecessary onsite tests reminded me of the medical ethics debates surrounding trialling new drugs. If you knew you’d got a medication that was working effectively against a cancer, why leave thousands of poor test patients in the control group with a supply of sugar pills instead of treatment that would actually help them?

However, if I’d stopped to think through the medical analogy properly, I might have realised earlier why my view was misguided.

Avoiding dragging out the process

Two things changed my mind last year and are why I no longer grumble about vanity A/B tests.

The first is that A/B tests are, of course, only indulgent or unnecessarily drawn-out if you let them be. We have a terrific data team where I work at Tes and they have assisted product with some handy ways to see much faster when we have a result that is statistically significant enough for us to move on.

In the past we might have been tracking the conversion rate on our Test against the Control. We would decide what proportion of users would see the test page, potentially starting conservatively with just 10% or 20%.

We would then figure out how big a sample size we needed to interact on the new page before we could be sure it was statistically significant enough, and run the test until we had at least hit that — and tried to avoid the temptation to end the test before then, unless the result was looking truly awful.

Last year my data expert colleague Rosie set up the tracking of my team’s A/B tests in Looker in a different way, so we could see the credibility margins (there is a very clear blog here on how to implement it, for those who like Looker and stuff to do with Bayesian analysis).

A few hours after the start of the test the result looked like this.

The control group was already getting far more users so its likely result was in a narrower credibility range than the test group. Overall it was too early to tell.

Just a day or two later though, we could see a very clear gap between the two groups. We had not collected as many users for the test group as we might have done in the past — but even before reaching that we could tell with some confidence that the test was, um… a failure, so we stopped it.

Is there ever a guaranteed slam-dunk?

The second reason I’ve stopped being rude about vanity A/B tests has been a growing realisation that I’m afraid is naive and obvious: seemingly guaranteed slam-dunks don’t always work. See the test above as an actual example.

I had several cases last year where my teams weren’t able, or willing, to A/B test before putting out a change, but I did not worry as the outcome seemed so assured. And in fairness, as most really were fixes, they normally did the job.

But in two key cases they didn’t. One involved a change to get users to visit more sections of our site from a specific set of pages. It seemed an assured win because it introduced a direct route for users to get to those other sections that did not exist. The users had no obvious pathway before. Now they did. Clearly our metric would go up. Yet instead it went down.

Another case involved a change that performed brilliantly in user testing and would obviously improve conversion, so we shipped it. Yet, again, afterwards the metric went down.

Of course, this is the reason why medical tests remain so extensive. What might seem a wonder pill in preliminary trials may not actually prove to be more effective than the placebo when tests have been done at a significant enough scale. It is a sadness for the control group if the drug did work, but without them far more patients could get ineffective treatment.

Plus, as another of my data colleagues Nick Creagh explained to me, medical trials of life-saving approaches will usually put a new treatment against the best existing known current approach, not simply a placebo. “In the same way, you don’t A/B test against a blank page - you test against the current best available alternative,” he said. And he’s right.

So I have stopped grumbling about vanity A/B tests on websites. They’re not an act of vanity unless you drag them out — or make too much fuss celebrating when the outcome turns out to be the obvious one. And sometimes the ugly option still wins.