Attribution analysis: How to measure impact? (Part 2 of 2)

Published in

Data Science at Microsoft

9 min readAug 15, 2020

By Lisa Cohen, Ryan Bouchard, Jane Huang, Daniel Yehdego and Siddharth Kumar

Introduction

This is our second article in a series where we focus on methods for determining the impact of customer engagement efforts, a common question that data scientists face in the enterprise. In our last article, we showed methods for attribution analysis in cases where there are one or many treatments in play (using single or multi-attribution). Those methods provide ways to quantify the correlation between treatments and outcome metrics. In this article, we’ll cover causal inference techniques to determine causality. We’ll also share approaches that we’ve taken to make this analysis actionable for the business.

Causal inference

We’ve discussed the distinction between techniques that allow us to conclude correlation versus those that indicate causation. “Correlation does not imply causation” is a common reminder in the field of statistics and data science. This site features several amusing spurious correlations to remind us about the importance of not conflating these two concepts. Another method (beyond randomized controlled trials “RCTs”) that we use to evaluate causation is causal inference. This is particularly valuable in scenarios where it’s not feasible to run an experiment due to business considerations. The Book of Why, by Judea Pearl, is a fun and accessible read for those interested in an introduction to this topic. The Coursera course “A Crash Course in Causality: Inferring Causal Effects from Observational Data” is another good reference.

The first step in causal modeling is to develop the causal diagram. A key activity as part of this work is hypothesizing the variables that could have an impact on the outcomes, which requires a combination of business context and a healthy dose of imagination. A good approach to this is asking “What if?” questions. Causal diagrams include the treatment under review, the desired goal or outcome, and confounding variables (i.e., other variables that can also have an impact on the outcome):

Figure 1: Components of a causal diagram.

In our example above, the treatment is the investment program that is aimed at helping customers be successful in their adoption of the Azure cloud. A key outcome we measure is their usage of Azure. Finally, the confounding variables include customer size, geography, industry, types of Azure services used, usage levels, and so on. Ultimately, confounding variables include many of the factors that you would typically control for in a RCT.

We also consider instrumental variables, which don’t influence the outcome directly but have an impact on the treatment and therefore affect the outcome indirectly. For example, someone’s attitude toward safety influences their likelihood to wear a seatbelt, as well as their driving practices, which both lead to safety results. Similarly, someone’s lifestyle affects their likeliness of taking prescribed medications, as well as maintaining other healthy habits, which in turn lead to health results. In the context of our scenarios, the relative effectiveness and attitudes of different program administrators have an impact on the program engagement and ultimately program results. Therefore if the assignment of program administrators is random, we can consider the program administrator assignment to be an instrumental variable and use an algorithm like double least square to estimate treatment effects.

Here is what our resulting causal diagram looks like:

*Fig 2. Causal diagram for our example. Y: outcome; V: program treatment; X: observed confounding variables; Z: instrumental variables.（Reference:* https://microsoft.github.io/dowhy/）

Once we define the causal diagram, next we compile the data. Here are the dimensions we include for the current example:

For causal inference modeling, developing the dataset is one of the most critical — albeit time-consuming — parts of the process. Basically, we need to construct a control and treatment that have matching confounding variable values, as if we had set them up as a controlled experiment to begin with. Then we can run the usual statistical tests, prove or disprove hypotheses, and determine causality. In the case of a single attribution (single treatment) problem, we can directly compare these two groups, either by matching subsets of the population (if the sample size is large enough) or by comparing larger populations with matching frequencies of the confounding variables.

In the case of a multi-attribution problem with n treatments, we can either model each investment separately and control for all other investments, or allow a combination of treatments in the treatment space and then allocate the treatment effects to each investment.

One challenge that we face in this work is to ensure that our treatment and “constructed control” are appropriately matched with respect to the confounding variables. Here are a few techniques that we leverage for this:

Covariate matching: Obtain treated and controlled groups with similar covariate distributions (in our case, the confounding variables above) so that we can replicate a randomized experiment as closely as possible.
Propensity score matching (PSM) (as outlined in Stuart, Elizabeth 2010): Estimate the effect of an intervention by accounting for the covariates that predict receiving the treatment. (In the case of our example, propensity refers to the propensity for a customer to participate in a particular program.)
Propensity score weighting: Weigh the data based on the propensity scores. Add larger weight to individuals who are underrepresented in the sample and a lower weight to those who are over-represented. Then analyze the weighted sample. (This is the method that we actually use in the current example.)

In additional to the above traditional approaches for average treatment effect, a core problem that arises in data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: What is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample? More and more techniques at the intersection of econometrics and machine learning are gaining popularity to tackle the problem of heterogeneous treatment effect estimation. Those methods offer flexibility in modeling effect heterogeneity, while at the same time leveraging techniques from causal inference and econometrics to preserve the causal interpretation of the learned model and usually also offer statistical validity via the construction of valid confidence intervals. The Microsoft Research ALICE team has developed a Python library called EconML, which is a collection of state-of-the-art techniques under a common API, for the estimation of heterogeneous treatment effects from observational data via machine learning. We utilize approaches from this package such as meta learners and double machine learnings for investment recommender.

Ultimately, causal inference is an approach to reach a conclusion where we can state “we controlled for all the important features that could confound results, enough so that you can interpret the results as if a controlled experiment was run.” That, in itself, is a high bar and requires a deep understanding of the business context, our customers, and how customers use our services. However, it is required in order to remove the typical selection bias that otherwise exists when comparing customer populations who participated in a program versus those who did not.

Another challenge we face in these types of problems is determining whether our causal diagram is complete, and reflecting on “what we know” versus “what we don’t know” regarding the mechanisms leading to our desired outcome. One effective technique we’ve found is to test model performance by using the placebo treatment as explained in the refutation methods section of the DoWhy package by Amit Sharma, Emre Kiciman, et al., “DoWhy: A Python package for causal inference,” 2019. Essentially, we modify the dataset by randomly switching customers’ treatment programs and observe the impact on the model results. This shows us how sensitive the model is to the feature and indicates that our treatment variables have a significant impact on the model outcome.

Driving action with attribution insights

In data science, ultimately we don’t want to just produce data points, we want to drive decisions and actions. So, what can we actually do with this attribution work? Of course, if we find that one program is more effective than another, we may choose to modify our program investments to invest more in higher ROI activities. However, more often we find that we want to use this data to improve our existing programs.

For example, we can explore the ROI of a program by geography (or various other dimensions) in order to learn where it’s working better (versus worse) and then dig in further to understand why. Here is an example of a program, where we analyze the ROI for customers, by geo:

In the multi-attribution scenario, we can also conduct “path analysis” to learn which combinations, durations, and orders of programs are most effective. In the example below, we find that a particular (sample) program yields maximum ROI at seven months of duration. This insight can help inform how long we engage (although we may still choose to continue engaging after the ROI peak, if the program continues to have enough impact):

Another useful perspective is to consider the combination and sequencing of investments. In the example below we see that investments B and C are most effective when they’re used in combination together, and even more so when they are preceded by A:

Fig 9. *Path comparison (‘&’ refers to multiple treatments at the same time)*

Using this dataset, we also turn our descriptive analytics into predictive analytics and build recommendation models. In one scenario, we develop a “program-level” model to predict which customers will benefit most from a particular program. Using this model, we’re able to provide the program owners with a sorted list of potential customers, as well as “model explanation” details, summarizing the factors that led to each customer’s recommendation.

We also construct a “customer-level” model to recommend which program a customer would benefit from most, next. (Of course, this also requires program capacity.) The figure below shows an example of the “customer-level” model, where we recommend treatment C for the customer and predict how much the customer will grow with the treatment (beyond the otherwise projected baseline).

Conclusion

In this post, we’ve shared causal inference methods for determining whether a customer nurture activity caused a specific desired result. We also explored applications including recommendation models, that data scientists can use to make these insights actionable for the business. We hope these examples can be helpful for your work as well. Please leave a comment to share your attribution scenarios and the techniques you use today.

I’d like to acknowledge Saptarshi Chaudhuri, Shijing Fang, Saurabh Kumar, and Deepsha Menghani, who have been significant contributors to this work.

Attribution analysis: How to measure impact? (Part 2 of 2)

Introduction

Causal inference

Driving action with attribution insights

Further reading: A history of attribution modeling

Conclusion

Written by Lisa Cohen