Learnings from testing at scale

Designing with data is important to make sound decisions in product design. You can’t improve what you can’t measure. An approach to measuring design changes is using quantitative testing methods like split (A/B) or multi variate testing. Where do you start? Here are a few things I learnt while working at the Telegraph with an audience of 44 million monthly users and testing designs and concepts at scale.

Framing questions for the tests

The primary reason we are running a test is to answer a product question. To have an answer you must first ask the question through a hypothesis. The first step in defining a test is trying to come up with a hypothesis for the test. In addition to the “If Then Because” structure, hypothesis should be stated in a positive tone. The positive tone allows us to reverse it. This reverse hypothesis is call the negative hypothesis. With the null hypothesis we can rule out chance (aka Lady Luck) from our test. A null hypothesis is usually created by adding “does not” to the hypothesis. If the null hypothesis is true then we can say that the hypothesis as true.

When to use quantitative testing

Before we jump deeper into the world of quantitative testing, we need to go in knowing that not all questions can be answered with quantitative data. Data that is measured is considered as quantitative data. Data can also come in the form of interviews, observation or usability testing. This kind of data is known as qualitative data. My colleague Steven Coulomb uses this simple chart to help our team focus our tests. Let’s call this the Coulomb Test Filter.

Coulomb Test Filter

From our hypotheses we can arrange them to see where they fit with this filter. Remember this tip from Leisa Reichelt about mixing quantitative testing with qualitative testing — “you always need qualitative, you sometimes need quantitative. Qual before quant always”:

Optimisation vs. Impact

In quantitative testing you are using numbers to validate your product decisions. Making refinements on an existing design will improve it. However the improvements will reach a limit. To go beyond that limit you will need to rethink the strategy to have a bigger impact on the same goal.

What are you trying to learn? Is it optimisation or impact? What is more valuable for your product at this time? Rochelle King simplifies this idea into using a metaphor of fruits.

“Are you designing strawberry vs strawberry or strawberry, apples and oranges?” — Rochelle King

Iterate the learnings

Testing is not a one time event. Your tests should focus on one area of your product rather than the whole thing at one time. It is hard to pick out the cause and effect of any design changes otherwise. Optimisation and impact quantitative testing can happen in parallel if you are testing independent features. In my experience it easier to test in series but it will take more time. The reason for this structure is to take the winning learnings forward from one round to the next round. For each of the rounds of testing you will always have the baseline or the control to measure your changes against. If your new change is better than the control it moves on to the next round. If you are lucky enough to have multiple variants that are better than the control then you should move ahead with the variant that scored the highest. The other winning variants could be spun off in another line of test to optimise that solution. For the losing variants, it becomes a question of with limited resources should we be exploring this further or double down on our winning variants? Our best ideas will not be our best winners.

Whether in series or parallel — all good tests will need to converge

How many people need to see the test?

If we tested two variations and variation A converted with 60% of the users and variation B converted with 40% of the users. How confident would you be with this result that variation A was the winner?

This is called statistical significance. There are a few tools out there to measure if your results are statistically significant. As a designer I wanted to know what that meant for my test. It comes down to 3 levers that would determine how our team structured the tests to satisfy statistical significance:

  • Sample size: Will the traffic that will be going through these tests be a good reflection of our audience? You can use a calculator used to determine your sample size for each test variation.
  • Duration of test: Longer the test means more people could see it. This works in two ways which relate to sample size: first, if the area we are testing has low traffic then we would increase the time. Secondly, for a high traffic area, we could run the test on a smaller percentage of traffic over a longer period to get our desired sample size. This was useful if there was potential loss of business revenue due to a change in our test
  • Number of test variants: More variations mean that we will have to split the traffic. We want to focus on the tests that have the most impact. This meant that we had to narrow down options. Here is a good way that I use to focus on variants with the most impact.

Is it working?

It is tempting to check the metrics of a test early on in the test period. It is dangerous to make actions on those metrics early on the test. You set up the test by calculating the size, time and number of variations in a test so that you can be confident about the result. Don’t ruin it all by jumping the gun. But it is fun to make it a game to see which test will win.

Data says no to your designs

Qualitative testing and the data in the results won’t solve all of your product issues. Sometimes the result could come back to say that nothing changed or your new idea is worst. The data is there to support your design decisions. It won’t answer all of your questions but it can help you figure out where to start.

If you want to know more about qualitative testing, find out about how to asking better questions.

Good articles to read on data in design