How to better control for false positives while monitoring your experiment

Published in

Data Science at Microsoft

9 min readApr 12, 2022

Peeking and multi-testing are long-lived problems in the experimentation space as more products and features leverage experimentation for providing an understanding of their impact. Peeking refers to previewing the results of an experimentation test and taking action on them before the test completes. The multiple testing problem, according to Wikipedia, comes into play when multiple comparisons are being made among simultaneous statistical tests. In this article I discuss how I developed two data sets to evaluate methods addressing peeking and multiple comparison testing to reduce false positives during the monitoring phase of an experiment.

Business need

WExP (Windows Experimentation Platform) is the experimentation platform used internally at Microsoft to run operating system experiments. Similar to experimentation platforms used by other companies, it monitors an experiment’s progress by generating scorecards on a set cadence to ensure early detection of unwanted outcomes. The scorecards evaluate certain metrics using p-values from a standard Welch’s t-test, and these metrics consist of more than 200 measures evaluating the overall system in terms of quality and health.

Within this overall experimentation platform context, it is important to recognize that the standard p-values are only valid — in other words, are correctly controlling for Type I error — if the sample size is fixed in advance and observed only after that sample size is obtained. These conditions are to avoid the problems with peeking and multiple testing that I mentioned earlier, the implication being that if we don’t control for peeking and a large volume of comparisons, the result is an inflation in Type I error, leading to a higher than five percent false positive rate. For the business this translates into a higher rate of false alarms that require real time and effort to investigate, as well as reduced confidence in the platform. So, the goal of the analysis I describe in this article was to reduce false positives while ensuring our ability to identify true positives and ensure people are investigating worthwhile regressions.

Approach

We identified three groups of corrective approaches we wanted to evaluate, and they resulted in 11 variations. To be able to evaluate the approaches, we needed data, and so we created two data sets. One set used historical experiment scorecard data to approximate a labeled dataset. The second set simulated A/A tests to also generate scorecard history and provide true labeled data, though limited to true negatives and false positives. Using these two data sets we were able to evaluate the 11 variations.

Eleven approach variations

Below is a high-level summary of the three types of corrective approaches and their combinations, totaling 11 variations.

The family-wise error rate (FWER) controls the probability of having at least one or more false discoveries (which helps to account for multiple testing). While it is one of the simplest multiplicity corrections to understand and implement, it might be too conservative of a correction as the number of tests grows, making it harder to detect true movement and likely increasing our probability of false negatives (lower power, ability to detect true positives).

Variation B, Bonferroni: Adjusting the 0.05 alpha by the number of selected measures being evaluated.

The false detection rate (FDR) controls the fraction of all discoveries that are false (which helps to account for multiple testing). This approach is slightly more complex to implement, and though it has greater power than FWER, this mitigation comes at the cost of increased false negatives (lower power, ability to detect true positives).

Variation BH, Benjamini-Hochberg: Adjusting the 0.05 alpha based on the number of selected measures being evaluated and rank-ordered by p-values.
Variation BY, Benjamini-Yekutieli: Adjusting the 0.05 alpha based on the number of selected measures being evaluated and rank-ordered by p-values and a harmonic modifier.

Alpha spending controls the overall Type I error across the experiment duration (which helps to account for peeking).

A key component required for the alpha spending approach is the information fraction, which is the fraction of total information expected at the scheduled end of the experiment. The information fraction is dependent on the time of evaluation, so the Day 1 scorecard will have the lowest information fraction (approximately 10 percent) and the Day 28 scorecard will have the full information fraction (approximately 100 percent).

It is a complex problem to accurately estimate the information fraction on a given day, and so for this analysis we decided to use the true information fraction to see how well the approach works when we have full information. This will help us understand whether it is “worth” pursing alpha spending, since it will require another set of analyses to evaluate how well we can estimate the day’s information fraction in real time.

Variation OBF, O’Brien-Fleming: Adjusting the 0.05 alpha based on the true information fraction.
Variation KDM, Kim and DeMets: Adjusting 0.05 alpha based on the true information fraction.

The approach of combining alpha spending + FWER/FDR addresses both the multiple testing and peeking issues head on. From this evaluation set we hope to answer the question of whether it is worth the additional processing to determine the information fraction to enable alpha spending as part of a combination. Again, to evaluate this we are using the true information fraction as it shows how well these combination approaches would work with full information.

Variation OBF_B, O’Brien-Fleming & Bonferroni: Adjusting the Bonferroni calculated alpha based on the true information fraction.
Variation OBF_BH, O’Brien-Fleming & Benjamini-Hochberg: Adjusting the Benjamini-Hochberg calculated alpha based on the true information fraction.
Variation OBF_BY, O’Brien-Fleming & Benjamini-Yekutieli: Adjusting the Benjamini-Yekutieli calculated alpha based on the true information fraction.
Variation KDM_B, Kim and DeMets & Bonferroni: Adjusting the Bonferroni calculated alpha based on the true information fraction.
Variation KDM_BH, Kim and DeMets & Benjamini-Hochberg: Adjusting the Benjamini-Hochberg calculated alpha based on the true information fraction.
Variation KDM_BY, Kim and DeMets & Benjamini-Yekutieli: Adjusting the Benjamini-Yekutieli calculated alpha based on the true information fraction.

Approximating labeled data

Although we have a history of experiments, we don’t truly have per-metric, per-scorecard labeled data identifying whether each one was a true positive, false positive, true negative, or false negative. And so, we approximated labels like “true positive” by summarizing real experiment scorecard history and identifying “consistent statistically significant movements.” Consistent means that for a given experiment, once a metric is identified as statistically significant, it remains so for all subsequent scorecards.

In the example table above, Experiment A has three metrics.

Metric 1 and 2 bounce back and forth between statistically significant and not statistically significant across their scorecard history. This does not meet the consistency criteria and would be labeled as a false positive.
Metric 3, once identified as statistically significant, remains so across its scorecard history. This meets the consistency criteria and would be labeled as a true positive.

Using this consistency approximation, we can create a labeled dataset by which to evaluate:

Recall: True Positive / True Positive + False Negative. (Of the actual class, what proportion did we identify.)
Precision: True Positive / True Positive + False Positive. (How accurate are we when identifying something as statistically significant.)
False Positive Rate: False Positive / False Positive + True Negative. (The probability of falsely rejecting the null hypothesis or the proportion of true negatives that are misclassified as positives.)

For the consistency evaluation we filtered to experiments with a minimum of six scorecards, and evaluated metrics with at least two statistically significant movements. (If we have only one statistically significant movement, we don’t have enough history to use the label “Consistent,” and so this would be labeled a False Positive).

Simulated A/A scorecard data

An A/A test (or null test) is the same as an A/B test, except there is no difference between Treatment and Control. The value in running an A/A test is that because we know there are no true differences between Treatment and Control, we can accurately evaluate the false positive rate given various mitigation techniques. An A/A scorecard can have only false positives (anything identified as statistically significant) and true negatives.

Using this A/A data we can accurately evaluate:

False Positive Rate: False Positive / False Positive + True Negative. (The probability of falsely rejecting the null hypothesis or the proportion of true negatives that are misclassified as positives.)

Results

The table below summarizes all 11 approaches evaluated:

How to read the table

Across the columns we have all the approaches listed with abbreviated names, and down the rows we have various evaluation calculations with their formulas.

Implementation Cost summarizes the added processing and coding required to stand up the given approach.
Risk in Correct Calculation summarizes the chance of not being able to calculate the given approach correctly.

The SS column includes the baseline scorecard alpha of 0.05 to beat and is included as a reference.

Eliminating the worst performers

Looking at OBF and KDM we see that although their tradeoff between Precision and Recall is good, their A/A False Positive Rate is one of the highest and is illustrated with the high counts of rule violations and false positives. Given their poor performance and difficulty in implementation, we can remove these approaches from consideration.

Looking between the combination OBF and combination KDM approaches, we see overall that the combination OBF ones perform better than the combination KDM ones in terms of Precision, Recall and A/A False Positive Rate. Between the two combination approaches we will remove the combination KDM ones from consideration. With the OBF combination we will also remove OBF_B as it is the worst performing of the three.
This leaves us with a much smaller set of five to consider:

Looking across all five approaches, the A/A False Positive Rate is quite low. We are doing much better than SS with any of these approaches, which is awesome to see.
Comparing the two OBF combination approaches to the three B approaches, we see that although the OBF combinations have slightly higher Precision, they also have worse Recall and only slightly better A/A False Positive Rate. These results are not so much better than the three B approaches, nor as good as we had hoped. This outcome, coupled with the fact that we know we will produce slightly worse results in the wild — given that we will have to predict the information fraction in real time — helped us decide to eliminate the last of the alpha spending approaches from consideration.

Reviewing the best performers

Looking across the three B approaches (Bonferroni (B), Benjamini-Hochberg (BH), and Benjamini-Yekutieli (BY)) we likely have no real difference in terms of Precision and AA False Positive Rate, but it does look like BH has better Recall than both BY and B.
This, coupled with the fact that we already use BH elsewhere in the stack, and the potential gain in Recall, leads us to recommend BH as the methodology by which to adjust alpha for evaluation.

Conclusion

Having developed two data sets we were able to reasonably evaluate the 11 approaches and identified Benjamini-Hochberg as the best mix in terms of performance, implementation cost, and maintenance. Today we use the Benjamini-Hochberg correction in creating an adjusted alpha for evaluating the relevant quality and health measure movements during the monitoring phase and have greatly reduced spurious false positives.

Anfisa Rovinsky is on LinkedIn.