Promoting fairness through experimentation for Windows Experimentation Platform (WExp)

Published in

Data Science at Microsoft

5 min readApr 26, 2022

Traditional experimental procedure has focused on the comparison of average business metrics. Despite its great success in the industry, exclusively focusing on the average effect may hide significant disparities among users, leading to designs that may benefit a small portion of users but have a negative impact on the majority (see Figure 1).

Figure 1: Both treatments have an average treatment effect of 2. Treatment 1 lifts everyone by the same value, whereas Treatment 2 has a negative impact on a majority of users, leading to increased inequality.

The Windows Experimentation Platform (WExp) is designed to measure the end-to-end performance of new Windows features and functionalities through randomized controlled experiments. My team and many others at Microsoft use this internal platform to run hundreds of experiments each year. To reduce bias and use WExp in a responsible way, my team implemented two statistical methodologies to better equip WExp with tools for automatically detecting anomalies in distributions that are indicative of hidden bias and inequality issues.

Methodology

Our methodology encompasses the Atkinson index for measuring inequality and employing causal trees for root cause analysis.

Atkinson index for measuring inequality

The Atkinson index is a well-established index for assessing how unequally personal income is distributed across the population in economic research, and has been adopted by LinkedIn to measure inequality in their experimentation platform (see here for more details). Mathematically, it is defined as:

where ε is the inequality aversion parameter that measures one’s willingness to accept a smaller total value of the metric in exchange for a more equal allocation. The Atkinson index equals 0 only if everyone has the same value of the metric and approaches 1 when the metric becomes more concentrated in a small number of individuals.

Note that the Atkinson index, like an average, is also a population parameter, and it provides another perspective for understanding distribution; that is, the skewness of the underlying distribution. A comparison between the treatment and control with respect to the Atkinson index can be formulated as a hypothesis testing problem, where we use the collected sample data to infer whether the Atkinson index of the two groups is equal under the classic Neyman-Pearson hypothesis testing framework. The results from the hypothesis testing can help us answer two questions: 1) Whether the treatment has caused a change in how the metric is distributed across users or devices; and 2) Whether the distribution is more equal or more unequal across users or devices.

Causal tree for root cause analysis

When hypothesis test on the Atkinson index reveals a move in equality comparing treatment to control, it is of fundamental interest to understand the potential root cause, which is possible by estimating heterogeneous treatment effects (in other words, when different subpopulations are affected differently by the treatment). For example, given an average treatment effect of +2, we often think that it is every individual who gets lifted by +2, but this is hardly the case in real-world problems.

Causal trees (Athey and Imbens, 2016) have recently emerged as a flexible and powerful tool for estimating heterogeneous treatment effects. Causal trees are similar to decision trees but with the target function being the conditional average treatment effect (CATE):

The results given by causal trees are crucial for generating actionable insights to mitigate undesirable changes in inequality.

Figure 2: A causal tree constructed from synthetic data. The conditional average treatment effect of the notebook, female subpopulation is 12; the notebook, male subpopulation is 0; the desktop, female subpopulation is 0.5; the desktop, male subpopulation is 3.5. In contrast, if we focus on the comparison of the overall average treatment effect, we know only that the overall average treatment effect is equal to 4 (root).

Implementation

Experimentation data tends to be huge, and hence is typically stored in distributed databases. The tech stacks involved in the implementation include Azure Cosmos DB, SparkContext (provided by Azure Databricks) and R.

From statistical insights to actions

Once a movement in the Atkinson index (i.e., a fairness issue) is detected, to promote awareness and facilitate actions, we ask users of our platform to go through an investigation workflow that consists of three steps:

Determine the desired direction of movement. As illustrated in the table below, we want Total System Usage Time to be higher and more equally distributed across users so that we can serve everyone better, but we might want the number of crashes to be less equally distributed across users; that is, we would rather that fewer crashes occur in total and on a smaller number of devices.
Investigate what is driving the movement. Review the results from causal trees to understand the potential root cause.
Resolve the investigation by acknowledging the undesirable movements and then taking action to mitigate the undesirable movements, or marking as false positive with feedback.

This assessment tool as I’ve described it above has been running in production and the effect that this methodological update has had within Windows can be seen in our products. In the past year, we identified multiple issues and those recognized as true positives have led to changes in decisions around feature rollout.

For example, in a recent new feature, we identified increased inequality from control to treatment and a deep dive analysis via causal tree showed that certain device types were affected negatively despite a positive overall effect. We communicated with the experiment owner team about the result to raise awareness, which resulted in a very constructive internal discussion.

In another experiment, our fairness assessment tool flagged an issue at an early stage, which helped the team identify a substantial performance regression in a specific region. As a result, the feature team decided to delay and initiated an investigation to further improve the feature.

Conclusion

Enabling WExp to be more fair and responsible is what we have been and will continue to strive for. This work represents just one step towards this goal, and we envision more work to come, including correcting for unrepresentative samples as well as assessing fairness from a shift in quantiles perspective, among others. We will continue to build our knowledge about measuring fairness using statistical tools, and more importantly, promoting organization-wide awareness of being mindful about the potential fairness issue behind an average positive outcome.

Fan Yin is on LinkedIn.