5 steps to a better A/B testing process

No magic dust, just planning and hard work.

4 min readMar 21, 2016

After countless tests, and experimenting with different ways to marry the UX, prioritisation and testing processes, here is what I find works:

Looks more complicated than it really is

Let’s explore the key parts of this individually.

1. Generate a master sketch per funnel step

Achieving growth for visitors to our site (and just about any e-commerce site) is a matter of improving the performance of any of these steps:

Each step has a primary metric. Simple, right? The means of achieving this growth are not always as obvious, but luckily that’s the fun part.

Focussing on one particular funnel step at a time, we’ll pull together a cross-functional team into a workshop to agree on the most high-impact changes we could make. We’ll usually start by reviewing current usage, user research, customer feedback, Hotjar recordings, best-in-class examples and business objectives. This is followed by a number of rounds of rapid sketching, and we end up with a “master” sketch that represents the best of our ideas.

2. Break the design down into testable components

The master sketch will then be translated into a high-fidelity image or prototype to ensure ideas survive the transition from Sharpies to pixels.

Prototype based on master sketch, showing what would become two experiments

The design will then be broken down into independently testable elements. In the example above, the category header and horizontal navigation represent two testable elements.

We’ll then generate a test plan for each component which covers the hypothesis, metrics to track and any relevant research or data. The primary metric we are tracking is almost always the main test metric for that funnel step – for example product pages measure adds to cart, not completed transactions.

3. Schedule tests to iterate your way towards the master sketch

Armed with a refined sketch and a number of test plans, we schedule the tests to iterate our way towards the master sketch. This slows the rate of change, but ensures that we are creating value and learning as much as possible with every experiment.

Because some of these depend on the success of the previous step, we need to make some assumptions here about what is likely to win, but also be flexible enough that it’s OK to be wrong. Here is an example of some tests in the queue for lister and product pages:

An example of the queue for lister and product templates

Because each experiment’s goal is getting to the next funnel step, we can run experiments on each funnel step concurrently without them interfering with each other (or at least that’s the theory).

Lister and product page tests generally run for two weeks, and we offset these so a new test launches every week, on a Tuesday. We’ve found that alternating between big and small tests gives us a higher throughput through our small development team. There are sometimes other items in here drawn from the backlog which are unrelated to the master sketch.

4. Build, user tests and QA

All experiments are reviewed internally using pull requests for tech QA and Heroku review apps for non techies. If the test introduces new behaviour, we will also run user tests (using WhatUsersDo) at this stage to check usability.

5. Implement winning tests and compile learnings

We keep a very close eye on which tests are likely to win, and often prepare the implementation started before the test has concluded, in the case of a likely win. Conversion experts rarely talk about the amount of work required to implement a winning test, but in my experience it can take as long to implement the variation as it took to build the test, once you’ve accounted for all the corners you’ve cut to validate your hypothesis.

If the test has won on the primary metric (or is not worse, or is improved but at low statistical confidence), and has not caused any negative side effects, we implement it and launch the next test before we have done deeper analysis. This bias to action gives us the right balance of speed and confidence.

Deep analysis can take anything from 30 minutes to many days, depending on how much there is to be learned. This usually consists of Google Sheets filled with Google Analytics data, but it can also include watching Hotjar sessions, analysing products sold, or any number of other data sources.

And repeat

Users are surprising. Often the side-effects are more interesting than the primary metric we are measuring, which leads to us generating more ideas back into the process.

By feeding the build-measure-learn cycle, we are continually increasing the quality of the hypotheses and generating higher stakes tests with higher degrees of confidence.