Papers We Love — Lessons Learned from Microsoft and Yahoo Product Experimentation Experts

Split.io
5 min readNov 9, 2017

--

Author: Adil Aijaz, CEO @ Split

Metrics have been top of mind for me. This is because online experimentation boils down to measuring outcomes against product development and metrics are at the heart of measuring outcomes. While there is plenty of literature on experimentation, there is a dearth of published work on metric development.

Enter Alex Deng and Xiaolin Shi of Microsoft who published a great practical paper on metric development at KDD’16 titled “Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned.” I highly recommend reading it. Here are my high level notes on the paper.

Roles of Metrics

For Deng & Shi, metrics are meant to serve the following three roles:

Goal Metrics: aka Overall Evaluation Criteria (OEC), measure whether you are improving your customer experience. For instance, rides/user maybe a good goal metric for Uber.

Guardrail Metrics: are metrics you don’t want to mess up while optimizing the goal metrics. For instance, the maps team at Uber may aim to reduce pick up times. While doing so they should not reduce rides/user. In this case, rides/user is a guardrail metric.

Debugging Metrics: are meant to help you understand the impact on guardrail or goal metrics. For instance, if there is a rider satisfaction score that is made up of many signals, each of those signals would make a debugging metric.

Types of Metrics

Now that we understand their roles, let’s go through the types of metrics Deng & Shi propose:

Type 1 — Business Report Driven Metrics: These are high level metrics that measure the goals of an organization. Rides / user and number of connections / user are great examples of business metrics for Uber and LinkedIn respectively. The benefit of business metrics is that the entire company understands them. Their disadvantage is that as the product matures, incremental features don’t move the needle significantly on business metrics. In general, business metrics can be great guardrail metrics.

Type 2 — Simple Heuristic Based Metrics: These metrics boil down to simple interaction metrics like CTR or latency. They are simple to compute, but don’t give the high level picture of impact on customer experience.

Type 3 — User Behavior Driven Metrics: Type 1 and 2 metrics are great starting points for organizations new to experimentation. However, as the experimentation culture matures, it may be valuable to move to user behavior model driven metrics. User behavior models are machine learnt models that take a large number of signals as inputs and output one score that measures customer experience. The ability to clearly define these models is key, otherwise, debugging becomes hard.

What Defines a Good Metric

For a metric to be a goal metric, it should have two qualities:

Directionality: The metric should move consistently in one direction (positive or negative) if the customer experience is improving or degrading. Otherwise, it is worthless for the purpose of experimentation.

Sensitivity: The metric should move due to slight changes in customer experience. This allows teams to make decisions in the shortest possible time.

How Do You Measure if a Metric is Good

Deng & Shi propose two methods to measure directionality and sensitivity of a metric.

Validation Corpus: If an organization has a history of successful experiments, it can keep those experiments around as a validation corpus. All new metrics are tested for directionality and sensitivity on this corpus. While this is a robust methodology, most companies do not have the history of successful experiments of Microsoft. Moreover, it is not clear to me whether we could go back in time and re-measure some of those metrics at the time the experiments ran.

Degradation Experiment: A far more bold but intuitive proposal by the authors is to do live tests in production. Intentionally degrade customer experience for a small percentage of customers and measure the directionality and sensitivity of a metric. Most people will instinctively recoil in horror at the thought of degrading customer experience, but the authors argue from experience that if degradation is within bounds, it does not have long-term negative impact.

Deep Dive on Rate Metrics

A rate metric is a ratio between a numerator and a denominator where the denominator is not the experimental unit. Take Uber for instance. Their experimental unit is riders. (# of rides / rider) would not be a rate metric, but distance / ride is a rate metric. For regular web traffic, CTR is an example of a rate metric.

The reason why we make a special case for the experimental unit is that the unit is randomly distributed across treatment and control. So there is no risk of bias. But, if the denominator is not the experimental unit, it is hard to control for biases.

To control for biases, first keep both the numerator and denominator around for debugging purposes. Without that, it is hard to know if there is a bias. Second, ensure that the denominator is the same (or close to same) across treatment and control.

This is because if the denominator of the treatment drops without changing the numerator, the rate can go up without there being an actual improvement. This leads to ambiguity. Refer to the paper for a more in-depth discussion of this issue.

Lastly, understand that there are two ways for computing mean and variance for rate metrics: ratio of averages (e.g. total # clicks / total # of views) or average of ratios (e.g. avg of CTR across all users). The former emphasizes the experience of power users of your service while the latter treats them all uniformly. If you can compute both, great, but if you have to pick one I’d recommend the average of ratios.

Summary

There is a lot more detail in the paper than what I covered in this summary. However, my writeup should give you a good flavor of how the team at Microsoft thinks about metrics in online experiments. Personally, I found their emphasis on directionality and sensitivity as criteria for judging metrics valuable. I also really appreciated the suggestion around degradation of user experience to test a metric.

At Split, we use number of splits / account, page load latency, and # of users / account as key metrics in our own experimentation. We haven’t yet found the need to move to Type 3 metrics. I suspect that level of sophistication is required at the scale of Facebook and Microsoft.

Hope you enjoy the paper!

--

--

Split.io

Split Software is a leader in intelligent feature management and experimentation.