A Better Way to Test for Sample Ratio Mismatches (SRMs) and Validate Experiment Implementations

8 min readNov 12, 2020

…or why I don’t use a Chi-squared test.

Preamble:

This post is intended to be a gentle introduction to our latest whitepaper which describes a new methodology for validating experiment implementations through sequential testing of sample ratio mismatches (SRMs). The novelty of this work is that it presents a new sequential statistical test for multinomial data.

Introduction: What and Why?

In my time as a statistician at Optimizely, one of the most frequent questions I encountered from customers concerned the bucket health of their experiments and sample ratio mismatches (SRMs). A sample ratio mismatch is a colloquial term given when the total units in each treatment group differ significantly from what would be expected in an experimental design with random assignment, each treatment group having its own assignment probability. As an example, consider an A/B test in which each experimental unit has an equal probability of receiving the treatment or control. If this experiment resulted in 3123 and 6877 units in the treatment and control groups respectively, then one may doubt its validity. Informally, this test is said to exhibit an SRM.

The two primary concerns when encountering an SRM are problems in the assignment mechanism, potentially having a non-ignorable assignment bias, or problems in the data collection, where data is potentially missing not at random. Realistically, online experiments can be difficult to implement and execute correctly. There are large engineering requirements, which may contain bugs, and data processing steps, which may contain incorrect logic, that can render causal conclusions from experiments invalid. To address this, many have popularized the method of testing for SRMs. See for example Lukas Vermeer’s keynote speech “One Neat Trick to Run Better Experiments” or Ronny Kohavi’s talk on “Trustworthy A/B Tests”. Indeed, SRMs can reveal a wealth of implementation errors, and their causes and fixes have been well documented in the experimentation literature,

“One of the most useful indicators of a variety of data quality issues is a Sample Ratio Mismatch (SRM)…”
— Fabijan et al

There is also the chance to rewatch our talk from Test and Learn 2020

Why I Don't Like Chi-Squared Tests (and Friends):

To the credit of those popularizing these ideas, it has now become a best practice to validate the experiment implementation by performing an SRM test. For the most part, this checks that the assignment mechanism and the data processing steps are performing as expected. The usual procedure is to perform a Chi-squared test on the total units observed in each treatment group against the intended assignment probabilities of the design, at the end of the data collection and prior to analysis. There is nothing wrong with this statistically, but there is one obvious flaw.

The issue is that one learns about a problem with the implementation only after all data collection is completed, which is arguably far too late. Ideally one seeks to validate the implementation at the outset, yet if performed too early, the Chi-squared test may not have enough power to reject the null and implementation errors may go undetected. Importantly recognizing that the Type-I, or false positive, error probability guarantee of the Chi-squared alone only holds when it is performed once - when should it be performed? The tension between running the test early enough to prevent wasted units but late enough to have sufficient power makes it very difficult to answer this question. Ultimately the Chi-squared test, and fixed-horizon relatives, are inflexible. This has led to the proliferation of some bad practices in the experimentation space.

Given the aforementioned difficulty, many practitioners incorrectly continuously monitor their experiments by repeatedly performing significance tests, usually in an ad hoc fashion, without any multiplicity correction, unknowingly increasing their chances of a Type-I error. The following situation is commonly observed. An experimentation team may have some doubt over the validity of an experiment implementation, and decide to run a Chi-squared test to investigate. The Chi-squared test does not reject the null, but in spite of this, the team’s doubt remains. They conclude that the Chi-squared test probably didn’t have enough power, deciding to run the test again tomorrow. This continues for a few iterations until the null is eventually rejected, resulting in a false positive.

Why does this happen? Well, a classic paper “Repeated significance tests on accumulating data” by Peter Armitage showed that the probability of obtaining a false positive using a Chi-squared test configured at the 0.05 level can increase up to 0.14 with as few as five usages. By allowing the decision to perform a new test depend on the outcome of a previous test, one risks sampling to a foregone conclusion. I’ll illustrate how problematic this can be in the next section.

False Positives of Chi-Squared Test Under Continuous Monitoring

Let’s take this to extremes, for the sake of example, and perform a Chi-squared test after every new datapoint, which in this application corresponds to recording a new unit in a treatment group. Let’s suppose there are five groups in this experiment (1 control, 4 treatments), and each unit is assigned to each group with a probability of 1/5. These assignment events can be simulated by creating a sequence of multinomial random variables (of size 1). An important point to stress is that these random variables have been simulated using the assignment probabilities of the intended design so that the null hypothesis is true in all simulations. After each new multinomial random variable let’s perform a Chi-squared test, configured at the alpha 0.1 level, on the accumulating set of data. The figure below shows the resulting p-value over time up to when the Chi-squared test rejected the null, displayed using the red dot.

p-value from chi-squared test for bias, imbalance and sample ratio mismatch SRM — p-value resulting from Chi-squared test over an accumulating set of multinomial data. Null rejected at approximately 300th multinomial draw (red dot). p-value threshold at 0,1 (dashed black line).

This is clearly not the outcome we seek as the data was simulated under the true null hypothesis. To see if we were just unlucky, let’s simulate a further 100 experiments.

p-values from chi-squared tests for bias, imbalance and sample ratio mismatch SRM — p-values from the Chi squared test on 100 accumulating sets of multinomial data simulated under the null hypothesis. p-value threshold at 0.1 (dashed black line). Null rejected when p-value falls below 0.1 (red dots).

79 out of the 100 experiments resulted in the rejection of the null hypothesis, and this a conservative number as we only looked at the first 1000 datapoints. This comes to the surprise of many who mistakenly believe that the Type-I error probability of a Chi-squared test, configured at the 0.1 level, is 0.1 no matter how many times it is performed.

A New Sequential Test

The previous section made the argument that Chi-squared test by itself is somewhat impractical for detecting SRMs because it is hard to get the timing right. Too soon and it might be underpowered, too late and much of an expensive experiment is wasted. Furthermore, we have established that when used incorrectly, through repeated tests without multiplicity corrections as seems to be common in practice, that Type-I error increases dramatically. At Optimizely we recognized the value in validating customer implementations but found the current tooling to be lacking. We sought to develop a new statistical test that could be performed after every datapoint, so that SRMs can be detected as early as possible, while still controlling Type-I error at the customer's desired level. These are exactly the properties possessed by our newly proposed test described in our latest paper:

Sequential Testing of Multinomial Hypotheses with Applications to Detecting Implementation Errors…

Simply randomized designs are one of the most common controlled experiments used to study causal effects. Failure of…

arxiv.org

At Optimizely we refer to this as our ssrm (sequential sample ratio mismatch) test. To illustrate how the performance of the ssrm-test compares against the Chi-squared test, let’s consider the sequential p-values delivered for the same simulations. To start, let’s look at a single simulation:

sequential p-value from Optimizely sequential srm test — Sequential p-value from the ssrm-test on an accumulating set of multinomial data simulated under the null hypothesis. Sequential p-value threshold at 0.1 (dashed black line).

Notice how the sample paths of the sequential p-value are piecewise constant and the test correctly did not reject the null. Again, is this just luck? Let’s simulate a further 100 such experiments under the true null hypothesis.

sequential p-values from Optimizely sequential srm test — Sequential p-values from the ssrm-test on 100 accumulating sets of multinomial data simulated under the null hypothesis. Sequential p-value threshold at 0.1 (dashed black line). Null rejected when p-value falls below 0.1 (red dots).

Only 7 of the 100 experiments rejected the null, which is in line with what one should expect for a test configured at the 0.1 alpha level and data simulated under the null. Note that the probability of making a Type-I error using the ssrm-test is controlled no matter how many samples are collected, and no matter how many times it is used.

Rapid Detection of Errors

The previous simulations illustrated the behaviour of the ssrm-test when the null is correct and demonstrated control over Type-I errors. What about Type-II errors? Suppose there is an implementation error in the experiment setup and units are being received at the end of the pipeline with a different probability than the intended design. Let’s investigate this by simulating multinomial random variables with a slightly different assignment probability, specifically, [0.2, 0.2, 0.2, 0.1, 0.3] instead of being equiprobable.

Detecting sample ratio mismatches with Optimizely’s sequential srm test — Sequential p-values from the ssrm-test on 100 accumulating sets of multinomial data simulated under the alternative hypothesis. Sequential p-value threshold at 0.1 (dashed black line). Null rejected when p-value falls below 0.1 (red dots).

The ssrm-test resulted in zero Type-II errors in this simulation study, rejecting the null within the first few hundred datapoints. The value to the experimentation team is great, as this allows the implementation to be validated right at the beginning of the data collection, in contrast to the end.

Conclusion

In most experimentation teams the responsibility usually falls on the experimenter to manually perform an SRM check, rarely is it build as part of the platform. Arguably this validation should be performed automatically on every test.

Sequential tests, with their ability to be performed after every datapoint, are ideal methods for testing hypotheses about streaming data — exactly the type of data received from running online experiments.

It has been well established that SRMs are excellent signals of underlying implementation issues in randomized experiments. For this reason, SRM testing methodology has garnered a lot of attention (see papers by Yahoo, Linkedin, Microsoft and this blog post by Twitter). Existing methodologies have certain limitations, as described in this post, and so Optimizely has developed a new sequential statistical test that allows SRMs to be detected as soon as possible while still controlling Type-I error. The details can be found in our new whitepaper.