The Measurement Problem Part 2: No Peeking

Published in

We Are Systematic

12 min readNov 11, 2020

Peeking at your A/B tests at the wrong time can invalidate the results. This is the second measurement problem, and why CRO teams need to understand the statistics behind their testing programs.

This is the second in a series of posts based on quantum physics as an extended metaphor for product development. You can read the first post here. I should caveat by saying I don’t understand quantum mechanics: no-one does. But I do find it fascinating and find it to be a wonderfully compelling and surprisingly comprehensive analogy, so why not?

This post is about A/B testing

If you work in any form of product development or digital marketing role, I’m pretty sure you’ll already be familiar with the concept of A/B testing or Conversion Rate Optimisation more broadly. If not, I’ve included a brief primer below. For those of you who know your apples from your oranges on this subject, feel free to skip to the next section where I jump back into the quantum.

A/B testing, a primer

If you manage a website, online store, app or digital product of any type, at some stage you’ll probably have been concerned with improving its performance with users.

By ‘performance’, I’m not referring to technical aspects like speed and stability (though, these are not irrelevant to the subject) but how well the product is meeting the goals of your users and subsequently your business.

Broadly, the practice of improving this performance is known as Conversion Rate Optimisation (CRO) and A/B testing is a way to ‘do’ CRO. What does all this awful jargon mean exactly?

What is CRO?

My favourite analogy for CRO isn’t digital at all. Let’s imagine you own a coffee shop. You measure the performance of the shop in many ways: perhaps by footfall (the number of visitors to your shop each day) or by the number of cups of coffee sold, your revenue.

The most useful metrics are contextual: coffee-sold-per-visitor and revenue-per-order are much more useful than sales or revenue because they tell you more about your performance relative to your situation.

A small-town coffee shop with 12 customers each morning might be very happy with 12 cups of coffee sold. Whereas your standard tax-avoiding coffee chain with hundreds of visitors would probably consider that to be one of their worst morning’s sales ever.

That’s where the rate in Conversion Rate Optimisation comes from. So what is a conversion?

This literally means the process of ‘converting’ a user from one ‘type’ of user to a better ‘type’ of user. It’s not getting clearer yet I know, but I promise you it will.

Back to our fictional coffee shop. Let’s say every person who ting-a-lings through the door is a visitor. Once they buy something, they’re a customer. If they come back the next day, they’re a returning visitor, and so on. Each of these types is sequentially ‘better’ for your business.

If you successfully sell someone a cup of coffee, you convert them from a visitor ‘type’ into the (better) customer ‘type’.

So if you have 10 visitors and you sell them 8 cups of coffee, your conversion rate is therefore 80%.

Finally, optimisation means what it means: to make better.

So Conversion Rate Optimisation is everything you do to make your rate of converting customers from one type to a better type, better. Making the coffee nicer, the queue shorter, giving out loyalty stamps — this is all CRO.

Except its not coffee. It’s online sales, email signups, lead generation, video plays, whatever you actually want users to do with your digital product.

Still with me?

So what is A/B testing?

A/B testing (sometimes referred to as A/B/n testing) is a tool people practising CRO use to optimise their products. It will be very familiar to anyone familiar with the scientific method.

It’s time for another analogy!

Say you’re working on a new drug that makes people taller. If you wanted to test how well your new drug works, you’d probably set up an experiment. Taking a randomised sample of potential customers for your drug, you’d split them into two groups: one that gets your new drug (the treatment group), and another that gets nothing or maybe a placebo (the control group).

You wait a bit, then measure how much taller everyone was compared to before your experiment. If the treatment group got taller at a faster rate than the control group, you declare the experiment a success, and everyone gets a bonus.

A/B testing is the same concept. You compare your existing web page or app against a different version and see which performs better against your chosen conversion rate, by splitting your users into two groups. One group sees the control (usually the ‘real’ version) and the other sees a potential challenger. The two groups are tested side-by-side to ensure the test is fair.

For example, you want to know what the best colour button to have on your web page is. You test the current red button (version A) against a blue button (version B) and see which wins. If the blue button performs better, you change the colour to blue for all users and plan your next test. Green, perhaps.

So that’s why it’s called A/B testing.

What does A/B/n mean?

You will sometimes see A/B testing referred to as A/B/n testing. This is because we aren’t limited to testing just two versions in any experiment. You might actually be running an A/B/C test, an A/B/C/D test, an A/B/C/D/E test, and so on for infinite variations. There are statistical reasons this might not be a good idea, but that’s a topic for another time.

This is what n means: n is any number of versions that might follow in the list, represented by n for brevity (and so your sentences don’t look silly).

How do you A/B test?

This is another big topic, but for the purposes of this primer, it’s enough to say there are software and apps that allow you to create new versions of web pages and apps, segment your users into control and test groups, then analyse the results. The good ones like Optimizely, VWO or Google Optimize (no, not you Adobe Target, you aren’t one of the good ones). The good ones will ensure your results are statistically valid before you act on them.

Back to the quantum

Now we’re all caught up on A/B testing, we can return to our quantum mechanics analogy. In my last post, I introduced the Measurement Problem, known as the Observer Effect in quantum physics.

This post continues the discussion of Measurement Problems, so if you haven’t read my first post, I’d encourage you to start there for the background.

If you don’t have time for that or would rather not, here’s a quick recap:

The Observer Effect is a phenomenon in quantum mechanics whereby the act of observing an event changes the outcome of the event
The most famous thought experiment on the Observer Effect is Schrödinger’s Cat
There are many similar Measurement Problems for user researchers, analysts and marketers, the subject of this mini-series
In Part 1, I argued the more clients are involved in projects given to agencies, the more they change the outcome — a creative and strategic error

The subject of this post is the Second Measurement Problem: peeking at A/B tests too early can change the result.

Testing for validity

If you’re taking the trouble to run an A/B test, then it’s probably safe to say the validity of the results is important.

If you end up with a false positive, (you roll out your challenger despite it actually not being better than your champion) or a false negative (you reject your challenger despite it being genuinely better) it can have obvious and disastrous consequences for your user experience and ultimately your bottom line.

The good news is statistics has solved this problem for us. There are a few different scientific tests we can use to validate our results, the most common are variations on Student’s T test.

The T test and other similar tests are known as frequentist tests. This is because they count the frequency of an event occurring in a sample and take it to be a probability of the event occurring in the wider population.

The test was proposed by William Sealy Gosset under the pseudonym Student. Its original application was to improve yields of barley for brewing Guinness, which I consider to be delightful trivia.

In essence, this test tells us the probability that differences in the two samples are due to random chance. This probability is expressed as a value p.

You’ll often see experiment results reported like this:

Version B was found to be 10% better than Version A (p=0.05)

p=0.05 means according to a T (or similar) test, the probability the observed difference would occur due to random chance is 5%. So 1 in every 20 times the test is run, the results would be a complete fluke.

When people refer to statistical significance this is what they mean. Is the result ‘significant’ from a statistical standpoint, i.e. real? It’s different from practical significance which is whether or not the result is interesting to you in real life. A difference of 0.0000001% between two samples could be proved statistically, but have no practical interest.

Statistical significance is usually expressed as a percentage, commonly phrased something like ‘Yes it’s 95% significant’ or ‘No, it’s only 80% significant’. These numbers are just the inverse of p. If p=0.05, this is the same as p=5%. And this is the same as 95% significance.

This is all well and good, but there is a critical assumption woven into frequentist testing that must be understood: that you will only check your results once.

Why is this important? Because of our Measurement Problem.

The Measurement Problem

And so we get to our second Measurement Problem of the series. This one is all about compound probability.

Say you’re running an A/B test. Your boss or client taps you on the shoulder (metaphorically) and asks how it’s going. You dutifully have a look at the results.

“Version B is currently 4% better than Version A, but at only 90% significance”

“Good good, let’s leave it a bit longer,” they say, and leave you alone.

The next day, overcomes boss/client again: “How’s it doing today?”

You check the results again: “Version B is now 6% better than version A, now at 95% significance.”

“Jolly good, another successful test! Roll it out then”

Except it’s probably not a successful test. The p value used to make the claim of 95% significance the second time the results were checked is now no longer an accurate measure of the probability of a false positive. In fact, the probability of a false positive has about doubled.

All because you peeked.

Observing the results has morphed reality to one where the p you see is no longer really p. It’s an imposter.

What happened?

Each time you observe your results, you’re calculating p, which you’ll remember is the probability of detecting a difference that is down to random chance. But it’s not actually probability— it’s a label we use to describe probability. This is an important distinction.

Let’s look at two scenarios:

Universe A

You run an A/B test for five days, checking the results once at the end of the test. There are 1000 visitors in your sample, 500 who saw version A and 500 who saw version B.

In the version A sample, 50 visitors converted.
Version A conversion rate = 10%.

In the version B sample, 67 visitors converted.
Version B conversion rate = 13%

You run a significance test. There is a probability that what you see is a false positive. We call it p.

In Universe A, p=0.047, or 4.7%, which meets our 95% significance threshold. Congratulations.

In Universe A, p is equal to the probability of false positive.

Universe B

You run an A/B test for five days, checking the results at the end of every day. There are 1000 visitors in your sample, 500 who saw version A and 500 who saw version B.

In the version A sample, 50 visitors converted.
Version A conversion rate = 10%.

In the version B sample, 67 visitors converted.
Version B conversion rate = 13%

You run a significance test. There is a probability that what you see is a false positive. We call it p.

In Universe B, p=0.047, the same as in Universe A.

But we know through basic logic that the probability you will see a false positive at some point has to be much higher in Universe B because you checked six times more than in Universe A.

In Universe B, p is not equal to the probability of false positive. Even though the number is the same as in Universe A, 0.047.

The “real” probability in Universe B, is a number around three times larger.

This means if you base your Universe B decisions on p, you may be making an error.

Hello, Observer Effect

By observing the experiment more often in Universe B you have changed reality so that p is no longer a good measure of probability.

Nothing is different in Universe B other than the behaviour of the person running the test. All the numbers are the same. p is the same. But the realities have branched. They are now two parallel dimensions in which the only difference is a philosophical concept about the probability of error.

(If this doesn’t turn your head inside out, then you’re probably a statistician, in which case I am a fan of you.)

How to avoid this error

Luckily for us, the field of statistics did not give up in a huff in 1969 [See endnote 1]. There are two very simple and one complicated way to avoid distorting reality and making bad decisions based on misleading p.

1. Calculate your sample size upfront

The test scenario I outlined above — even in Universe A where the tester obeyed the rules — is actually not a well-run test. This is because the sample sizes are too low for the difference in conversion rate and the p value to be statistically sound.

This is because our test is frequentist. Significance and p are both functions of the difference in the frequency of an event in two datasets, in this case, the frequency of conversions. They don’t inherently take into account the number of data points (visitors) in the dataset.

The best way to illustrate this is to imagine flipping a coin once. Whatever the result, the frequency of that occurrence (let’s say it’s heads) is now 100%. This is a very good percentage and would definitely appear to be statistically significant.

Say you flip the coin another two times. First, you get tails, then heads again. Should we conclude that there’s a 66.6% chance that any coin flip will result in heads?

No of course not. We know the chance of either outcome is 50%, and it just so happens we’ve only flipped the coin three times.

So you see — the sample size is important! You need enough data points (or visitors) in your sample to be confident that you’ve observed enough to draw conclusions about the wider population.

(As an interesting digression, how many times do you think you’d have to flip a coin to get exactly 50% probability of heads or tails to a statistically significant level? The answer is infinity. So better get flipping.)

For A/B testing, this means using a sample size calculator like this excellent one from Evan Miller to work out what your sample needs to be in advance, then checking your results once you hit that sample.

Using Evan’s calculator we can see we actually needed a sample size of 1,629 per variation (so double that for an A/B test) to consider an increase from 10% to 13% statistically proven.

2. Use Bayesian instead of frequentist tests

Bayesian probability is a different way of running tests using different maths. In Bayesian logic, each time a data point is added to the sample it’s used to gradually improve the estimated significance calculation.

So there’s no fixed sample horizon as in frequentist testing, but the more data you get, the more accurate your estimation of the probability of error.

Bayesian testing is very handy for making faster, more risk-based calls on the outcome of experiments. If a test isn’t significant using frequentist maths, and you’ve already used up your one ‘chance’ to check the results, you could try a bayesian calculator like this one from A/B Test Guide.

If your p value is too large for frequentist significance but Bayesian tests suggest there’s a 70% chance Version B will be better than Version A over the long term, that might still be a good bet to roll out. You might like those odds.

Opinion is divided over whether Bayesian statistics truly solve the peeking problem, but this is probably beyond the scope of this (already quite long) post.

3. Use a sequential A/B Testing method

I’m aware of at least three proposals for sequential A/B testing. This one from Evan Miller, this one from CXL and the AGILE method from AnalyticsToolkit. There is also this one if you aren’t afraid of mathematics.

All of these approaches propose ways to test a bit faster using structured and statistically safe methods. This is advanced level CRO, so proceed with caution. It’s enough for this post to simply say that there are ways to limit your exposure to false positives if you so desire.

Conclusion

Statistics, like quantum mechanics, is sometimes strange and counter-intuitive. In this post I’ve outlined a second Measurement Problem facing analysts, researchers and optimisers hoping to take a statistics-based approach, namely that simply observing your A/B test results at the wrong time can invalidate your results.

There is hope, however. A CRO pro will know that calculating sample sizes in advance is essential. Bayesian calculations offer a risk-based approach where frequentist tests fail us, and advanced sequential A/B testing methods allow the truly nerdy to test at a great pace without tearing the fabric of spacetime too drastically.

[1] Armitage P., McPherson, C.K., Rowe, B.C. (1969) “Repeated Significance Tests on Accumulating Data”, Journal of the Royal Statistical Society 132:235–244