The fourth Ghost of Experimentation: Peeking

By Lizzie Eardley, with Tom Oliver

👻 This post is part of a series exploring common misconceptions amongst experimentation practitioners that lead to Chasing Statistical Ghosts.

The Ghost of Temptation

This ghost is all our fault as experimenters. Understandably, when we have put in lots of effort developing a new feature or campaign we want it to be a success — and we want to know it’s a success quickly — but we must be careful to not introduce bias into the experiment. Even if you have a perfect AB testing setup, the human element can always unwittingly generate false positives.

As we discussed in the Third Ghost of Experimentation, every comparison you make has a chance of showing a false positive, so more comparisons means more false positives. We usually think of this Multiple Comparison Problem in relation to testing many different metrics for one experiment, but similar logic applies to testing the same metric multiple times over the course of an experiment. This is because there is always unavoidable noise and variance in our data, and the more times you check the data, the more likely it is that you’ll happen upon an instance where this noise misleadingly suggests a meaningful result.

For example, if we consider an AA test — a test where we know the two variants are identical and the true impact is zero — there will be random variations in the data we gather that causes our measured impact to fluctuate just due to chance. The below image shows, for an AA experiment which ran for 30 days, the p-values we’d get if we compared the performance of the two (identical) variants every day, using all the data available at that point in time. Even though there is no true difference in the performance of the variants, the measured p-value will continuously fluctuate just due to random noise — if we leave any AA experiment running long enough it will always momentarily dip below our significance threshold eventually.

Daily progression of p-values for an example AA experiment

When conducting AB tests, It is tempting to check your results whilst the test is running to see how it is performing, colloquially known as ‘peeking’, but in doing so you are giving this ghost more opportunities to fool you in to thinking an ineffective treatment has had an impact. Simply looking at the data is not a problem, but taking any actions based on what you see can introduce bias. After all, it is far more common to stop an experiment halfway through the intended run time because it is significant than it is to stop it because it hasn’t reached significance yet.

Classic frequentist hypothesis testing allows you to control the False Positive Rate (FPR), this is the chance of you incorrectly rejecting the null hypothesis. The false positive rate is also equivalent to the chance of an experiment with zero impact, e.g. an AA test, appearing to have had a significant effect. However, these AB testing statistics are only valid when you make one comparison only — they are based on the assumption that you will only make an inference using a snapshot of data at one particular, predetermined, point in time. Peeking at results before the end of the intended run time, and acting on what you see, invalidates this assumption and leads to an inflated false positive rate.

Imagine an experimenter who is unaware of this ghost: they regularly run experiments and check the results every day over a 14-day period — if it’s significant they stop the experiment and claim an impact has been found. With a p-value threshold of 0.05 they may think they’re fixing the false positive rate at 5%, however by checking 14 times rather than just once, there is a far larger chance of a test being falsely classified as impactful. We can estimate what their real false positive rate would be by simulating many AA tests and seeing how many would be classed as significant using this flawed procedure.

Daily progression of p-values for 20 simulated AA tests

The above plot shows the daily progression of p-values for 20 simulated AA-experiments run for 14 days. Once a test’s p-value crosses below the threshold, we call it significant regardless of what happens after — if the experimenter stops a test when it first crosses the threshold they do not get to see whether it eventually crosses back over to insignificant. With this daily peeking we’d class 6 of the 20 experiments as ‘significant’ (shown by the solid bold lines) — that’s 30% — but if we follow the correct procedure, by only inferring significance based on the p-values at the end of the 14 day period, then only 1 test appears significant, matching our expected 5% false positive rate.

The more times you check, the bigger the impact of this ghost. Below we can see how the expectation of the real false positive rate increases as the number of ‘peeks’ increases, assuming you make inferences the first time the experiment crosses the 0.05 p-value threshold.

How the real expected FPR increases the more times you peek at the results

How to avoid this ghost

The easiest way to avoid this ghost is simply to:

decide in advance how long your experiment will run and don’t act on any interim results

However, in reality this isn’t always possible when running a business — there are practical reasons why this rule may sometimes need to be set aside:

  • Minimising harm — It is wise for an experimenter to monitor their experiment for any signs of unintended harm, set-up issues or bugs. There may be times when an experiment becomes too risky or expensive to continue for its full cycle and needs to be stopped early — whilst this may be a sound business decision in such a scenario it is important to be aware that when an experiment is stopped early, for whatever reason, statistically speaking we learn nothing.
  • Maximising benefit — Similarly, a variant may show large positive impact and there are business reasons why it would be beneficial to stop early and send all traffic to the high-performing variant so that the business can maximize the benefit as soon as possible. Again, when a test is stopped early, we learn nothing.
  • Velocity — we want to learn as much as possible as quickly as possible. If we underestimated the impact of the treatment when doing our initial power calculations, the planned run-time will be longer than was really necessary and this will slow down our velocity of experimentation more than it needs to.

Peeking safely

All of the above applies to the most common statistical framework used in AB testing — Frequentist Fixed Horizon Hypothesis Testing. However if the practical implications of no-peeking are important to you, there are alternative frameworks that allow you ‘peek safely’ and enable you to act on results as they arrive whilst still maintaining statistical validity.

Sequential Hypothesis Testing

Sequential analyses are a family of hypothesis testing frameworks which do not require a fixed sample size or predetermined run time. They are commonly used in clinical trials where there are ethical implications of continuing a harmful trial or delaying the adoption of a beneficial treatment. Conceptually they work by ‘spreading’ the desired error-rate over multiple interim analyses — for true impacts, this can reduce the expected run time and allow an experiment to conclude earlier, however they can also reduce the power and make it harder to detect small effects. A further downside of these frameworks is that stopping early can introduce a bias leading to the overestimation of real impacts. Johari et al.¹ recently presented a variant of Sequential Testing applicable to AB tests which provides ‘always valid p-values’ and allows experimenters to safely make inferences at any time.

Multi-armed Bandits

Multi-armed Bandits are a kind of sequential testing, the name comes from an analogy with a gambler who faces multiple slot machines, each with different (unknown) payouts. To receive the biggest overall reward, the gambler must balance the temptation to play only the slot machine which appears to give the biggest payout against the benefit of learning about other slot machines. In an AB testing context, multi-armed bandit frameworks allow the experimenter to make a trade-off between ‘exploiting’ a seemingly optimum variant and ‘exploring’ other options to gather more data. Hence the framework automatically acts on interim data and it can prioritise gaining more ‘reward’ from the experiment over more learnings. Bandit tests are particularly appropriate for seasonal or short-term tests, such as testing versions of a banner promoting a Christmas sale — if the learnings are only useful or relevant for a short time then acting on the results as soon as possible is likely a top priority.

Bayesian Methods

Bayesian methods are an alternative to classic frequentist hypothesis testing — rather than simply aiming to accept or reject one hypothesis (the null) Bayesian methods use prior probabilities and available data to identify the most likely hypothesis to be true. Bayesian hypothesis testing is often considered ‘immune’ to peeking — this is not always correct² — however, with appropriate stopping rules and good practices it can be implemented in a way which allows continuous monitoring and early stopping³.

Join Skyscanner, see the world

Life-enriching travel isn’t just for our customers — it’s for our employees too! Skyscanner team members get £500 (or the local currency equivalent) towards the travel trip of their choice in 2019 — and that’s just one of the great benefits we offer. Read more about our benefits and apply online right here.

We’re hiring!