Increase your chances to capture feature impact with A/B tests

Wen Qin
Data Science at Microsoft
7 min readAug 10, 2021

Wen Qin is joined for this article by co-authors Alexandre Matos Martins and Widad Machmouchi.

A/B testing is a common method for evaluating the impact of specific features on an overall product or service. With this approach, each user is presented with one variant at random, and the A/B test evaluates whether the variants generate statistically significant different outcomes. Using this type of method can greatly help in making data-driven decisions, especially in times of uncertainty such as during the COVID-19 pandemic. In this way, A/B testing is a powerful tool to establish causation, but only when the underlying metrics as sufficiently sensitive. To evaluate sensitivity, people usually focus on statistical power, but they may not be aware of the other aspect — movement probability. In this blog post, we talk about why and how to measure metric sensitivity considering both aspects.

Why should you care about metric sensitivity?

Assume for a moment that you are a member of the Microsoft Azure Identity team, and you want to ensure that the imagery on an Authenticator app download prompt visually communicates the purpose of the app. You run an A/B test to see whether the new design has an impact on app downloads. The left figure below is the existing variant, and the one on the right is a new design variant being tested. You hypothesize that the new variant will increase downloads. Accordingly, you define a metric to track app downloads.

After the A/B test concludes, if the hypothesis test shows the change in the metric value is statistically significant, the conclusion is clear. But if it’s not significant, is the new design really not having an impact on downloads? The answer is unclear, because it can also be due to lack of metric sensitivity.

This is because when conducting A/B tests, designing a version of a metric is just the start. Having a sensitive metric is critical to effectively capturing feature impact, if any. In another words, you want to maximize the chance of detecting an effect when there is one and increase confidence that no effect exists when there is none. So how can you do that?

To evaluate metric sensitivity, experimenters typically perform power analysis. Statistical power is the probability of detecting an effect where there is an effect to detect in the first place. It depends on effect size, the sample size of the A/B test, and the chosen significance level. Power analysis allows you to estimate any of these variables, given specific values of the remaining ones.

In practice, experimenters usually estimate the effect size given a specific statistical power (e.g., 80 percent power means that you observe the metric move eight out of ten times when a real effect exists) and all the remaining variables needed by power analysis. In A/B tests, we also call that estimated effect size the minimum detectable treatment effect. It is directly associated with the observed effect on a metric so that it is easier to interpret compared to statistical power. With 80 percent power, if the real effect is greater than the minimum detectable treatment effect, you have an 80 percent chance of observing it. The smaller the minimum detectable treatment effect, the more sensitive the metric will be.

Let’s go back to the Azure Identity example. The absence of metric movement can be caused by the insensitivity of the metric. For example, given 80 percent power and a 0.05 significance level, suppose the minimum detectable treatment effect has been 50 percent of the metric value for the existing variant. That means that even though a 10 percent increase would be a huge win, you would be very unlikely to detect that change.

Statistical power is only one aspect of metric sensitivity, however. The other essential one is movement probability. This is the probability that a metric will move in a statistically significant manner. If a metric is unlikely to move even if the feature being tested has a significant impact, the A/B test will not provide actionable results. Consider, for example, a metric commonly used to evaluate website usage, such as sessions per user. It does not lack statistical power but is still relatively insensitive due to lack of movement probability. Because people’s search needs are limited, the metric is very hard to move. Therefore, even a big impact on user engagement can hide behind it.

Now that you have a better understanding of two factors with an impact on metric sensitivity, you might wonder what the implications are if the metric is insensitive. The answer is that you might miss the feature impact and make a sub-optimal product feature decision. For example, without noticing a regression in the data, you might add a feature to the product that hurts user satisfaction. Without observing an improvement, you might throw away a great idea and spend extra effort seeking alternatives. So, it is critical to analyze sensitivity before using a metric in A/B tests.

How do you evaluate metric sensitivity?

Methods for assessing metric sensitivity should consider both components: (1) power analysis to verify that the minimum detectable treatment effect is attainable, and (2) movement analysis performed on historical A/B tests.

Power analysis

With the results of power analysis, metric authors can discuss with feature teams whether the minimum detectable treatment effect for the metric is attainable in typical A/B tests. If the value under discussion is in the form of a percentage, it is a good idea to convert it to an absolute value to get a sense of the magnitude of the required change. If the minimum detectable treatment effect is far beyond what is expected in a typical A/B test, it may be evidence that the metric may rarely move.

Movement analysis

To evaluate a metric’s sensitivity, metric authors should study the metric’s behavior on historical A/B test results. Historical A/B tests are a great tool for evaluating metrics — they are referred to as an “experiment corpus” and are often used to assess the quality of a proposed metric.

Movement confusion matrix. The confusion matrix below helps provide an understanding of whether the metric moves as expected. The expectation is indicated by the label in the matrix and is supported by analyzing metrics and focusing especially on the A/B test results. In the table, the left column covers A/B tests where the alternative hypothesis that a treatment effect exists is true and the right one covers those where the null hypothesis is true. Each test from the experiment corpus with such labels can fit into one of the two columns.

The confusion matrix summarizes the behavior of the metric on the labeled corpus. A metric that is sensitive will have a large result for N1/(N1+N2) — in other words, it will be as close to 1 as possible. A robust metric (i.e., one that is less susceptible to noisy movements) will have a result for N3/(N3+N4) that is very close to the significance level or false positive rate.

Observed movement probability. While a labeled corpus can be very useful for metric evaluations, it takes a lot of time and effort to label A/B tests in a confident manner. An unlabeled corpus can be used to compare sensitivity among various candidate metrics. An unlabeled corpus consists of randomly selected tests and helps evaluate the overall movement probability. The observed movement probability can be used to compare sensitivity among metrics as long as the difference among the observed probabilities is greater than the significance level.

How do you design sensitive metrics?

So, suppose you performed the analysis above and found your metric to be insensitive. Before you discard the metric and start from scratch, below are some ways you can consider for improving the sensitivity of some metrics: (1) variations on metric definitions to help improve overall sensitivity, and (2) a variance reduction technique to efficiently improve statistical power.

Metric design

Sometimes a metric is determined to be insensitive, and yet it still accurately measures what the team wants to assess in an A/B test. Applying techniques from A/B metric design is one of the simplest ways to increase its sensitivity. Below are some techniques that can help you increase the sensitivity of your metric.

Apply a transformation to the metric value: One straightforward way to improve the sensitivity of a metric is by reducing the impact of outliers. Transformations (such as capping, log transformation, or change in aggregation level) change the computation of the metric to minimize the impact or remove it altogether.

Use alternative metric types: Typically, A/B metrics are averaged across randomization units. However, using alternative methods to aggregate these metrics can help increase sensitivity. Relevant metric types include proportion, conditional average, and percentile.

Try proxy metrics: If a metric represents a long-term goal, such as overall user satisfaction, it might be hard to move in the short period of an A/B test. Consider finding proxy metrics by predictive models.

Variance reduction

CUPED (Controlled experiment Using Pre-Experiment Data) is a variance reduction method that leverages the data prior to the A/B test to remove explainable variance during the test. It has been shown to be effective and is widely used in A/B tests at Microsoft.

Summary

To learn more about metric sensitivity analysis, read our recent blog post. There we talk about more details regarding practical methods for conducting sensitivity analysis considering statistical power and movement probability. We also provide additional tips for designing sensitive metrics. We hope everything we’ve written helps you understand the importance of and methods to use in metric sensitivity analysis for A/B tests.

--

--