Causal inference (Part 3 of 3): Model validation and applications

Jane Huang
Data Science at Microsoft
10 min readNov 18, 2020

By Jane Huang, Daniel Yehdego, Deepsha Menghani, Saurabh Kumar, and Siddharth Kumar

Introduction

This is our third article in a series in which we focus on causal inference methods and applications. In the first article, we walk through different business scenarios and discuss when and why causal inference can help. In the second article, we explore algorithm selection. In this article, we look at the most critical step for causal inference: Model validation.

In general, causal inference is an unsupervised learning technique that makes the task of validation challenging. The goal of causal inference is to understand the outcome of alternative courses of action. For example, what would happen if the key identifying assumptions might not be true? Only after making sure of the robustness of an estimate in light of unverified assumptions can the causal estimates be seen to be valid and unbiased.

In this article, we discuss a few model validation techniques, such as irrelevant confounder and placebo treatment. We discuss why it’s worth checking confidence intervals of causal estimates whenever possible. We also discuss the need for business validation, owing to the critical nature of stakeholder feedback and given its role in building trust in production models. Finally, we share a few applications of causal models within Microsoft.

Model validations

Before we go into the details of various validation approaches, it is worth mentioning the importance of confidence intervals within model outputs, because they offer an effective way to validate treatment effect estimates. Many estimators can deliver analytic confidence intervals for final treatment effects. These confidence intervals correctly adjust for the reuse of data across multiple stages of estimation. When multi-stage estimation is too complex for certain estimators, such as meta learners, it is still possible to get bootstrap confidence intervals, although this process is slow and may not be statistically valid.

Sensitivity analysis / Robustness checks

The most critical part of causal analysis is to check the sensitivity and robustness of an estimate against unverified assumptions. One way to do this is through refutation. In our use cases, we are refuting the estimates by adding noise to the common cause variable or replacing the treatment with a random variable. (For more, see the DoWhy and CausalML documentation.)

Irrelevant additional confounder: Adding a random common cause variable. In the irrelevant additional confounder technique, a white noise variable is added as one of the confounders. This allows us to monitor the observed changes of the treatment effect after refutation with respect to the state before refutation. Economically, we expect the changes in treatment effect to be small (typically under 10 percent) because if the model is robust enough, the causal estimate should be resistant to irrelevant additional confounders.

Placebo treatment: Replacing the treatment with a random (placebo) variable. In an experiment using placebo treatment, treatment variables are reshuffled while everything else stays the same. We expect the changes in the treatment effect to be significant due to the irreplaceable role of the treatment. Ideally, we would like to see the ATE turn into zero for the placebo treatment if a causal model is trained perfectly and robustly without any bias or variance. If not zero, then we would at least expect the causal estimates to vary significantly from the raw estimates. There are several types of placebo treatments. For example, the DoWhy package has two types of placebo treatments implemented. One is to generate random values for the treatment, and the other, called the “permute” approach, is where the original treatment values are permuted by row. The “permute” approach keeps the same treatment distribution, but the values are random after the shuffling. The idea here resembles the well-known permutation method that is used to come up with feature-importance, where the procedure breaks the relationship between feature and target, with the drop in model score indicative of how sensitive the model is to the feature. In the context of causal inference, this random shuffling or placebo is only applied to the treatment variables and not to the outcome or confounding features.

Some additional techniques include:

  • Subset validation. In experiments with subset validation, a random subset of the data is selected on which the analysis is rerun. In this case we must check the causal estimate across subsets. If the causal estimate is robust, the results should not vary significantly across different subsets.
  • Adding an unobserved common cause variable. Unlike the technique using an irrelevant additional confounder, we can simulate a common cause that is correlated with the treatment and outcome in this experiment. After completing the simulation, we add it as a confounder in the model and rerun the analysis. If the model is well designed to control for confounders, the causal estimates should not be too sensitive to the refutation.
  • Random replace. In an experiment using random replace, we randomly replace an existing covariate with an irrelevant variable. If the model is well designed, the causal estimates should not be too sensitive to the replacement.
  • Dummy outcome refuter. This approach asks the question: What would happen to the estimated causal effect when we replace the true outcome variable with an independent random variable? Ideally, we would like to see the treatment effect estimates turn into zero. If they’re not zero, then at least we would like to see the causal estimates vary significantly from the raw estimates. The idea of the dummy outcome refuter is similar to the placebo treatment, but instead of changing the treatment variable in the placebo, we change the outcome in the dummy outcome refuter.

Business validation

Business validation is one of the most critical and important validation steps to be done before the business starts to consume the insights generated from the causal analysis. If the estimated causal impact metric does not make sense to business stakeholders, decisions made based on causal analysis results might impact the business in unexpected ways. The way we perform business validation depends on the business scenario to which we want to apply the causal analysis. Below are some examples used in our projects:

Look at the ratio distribution of the estimated ITE (individual treatment effect) values with respect to the outcome variable. This allows us to see the distribution of the percent of the observed outcome variable caused by the treatment variable. This validation step enables our business stakeholders to evaluate their investment effectiveness for different customer tranches and provides them with a tool to tailor custom investment programs and prioritize customers based on business objectives. For example, Figure 1 shows a sample CATE (with simulated numbers) over the previous three-month average revenue distribution across managed customers, which ranges from zero to 16.5 percent.

Figure 1: Ratio of CATE over previous three-month average revenue distribution across managed customers. (Simulated data)

Rank customers based on predicted treatment effect and compare their post-intervention revenue across the rank quartiles for both treated and control groups. This allows us to diagnose whether our treatment effect can give us appropriate directional guidance in terms of investment recommendation. In the box plots that Figure 2 shows, we rank customers based on their predicted treatment effect for both treated and control groups. Then we bucket these customers into quartiles and compare how their respective revenue grew in the three months that follow. We find that the median incremental revenue for treated group customers is more than that of the control group customers. Meanwhile, the median difference between treated and control group is most pronounced for customers in the first quartile and gradually decreases as we move toward those in the third quartile. Ideally, we expect the trend to continue until the fourth quantile, but the samples in the in the fourth-quartile treated group show higher variance than other buckets, which does show the limitation of our model in a complex real-world environment. But the directional guidance overall from our models looks promising and interpretable. We will continue to monitor the model in our production system and enhance the model based on telemetry and feedback from stakeholder.

Figure 2: Box plot of actual M3 (three months into future) incremental revenue across the four predicted rank buckets for control and treatment.

Compare the recommended investment with actual investment for a validation period where the actual target output variable is known. Figure 3 shows a backtesting example (with simulated numbers) that indicates how to track the effectiveness of a recommendation following causal analysis. In this example, we look at those customers who received investment A in May 2020. Among the 200 customers, 80 were recommended to receive this investment (Group A), and the remaining 120 were not (Group B). From here we want to track the revenue growth difference for these two groups of customers to evaluate the effectiveness of the model recommendations. We define the average growth ratio as the post-intervention third-month revenue divided by the previous three-month average ACR. We are excited to see that the average growth ratio is higher for group A than for group B. However, we need to be careful about the conclusion because, if the sample size is small, it is better to continue to monitor the results as time goes on.

Figure 3: Backtesting treatment effect shown by following or not following recommendations.

Applications

Business decision evaluation, planning, and budgeting

Business evaluation, decision-making, planning, and budgeting are some commonly used applications of causal estimation and attribution. In the Customer Growth Analytics (CGA) organization within Microsoft, our mission is to learn from our customers, identifying patterns of success (and failure) so we can improve our cloud business and help nurture our customers. We have programs to help customers succeed with Azure, including technical and/or sales resources who work regularly with qualifying customers on their Azure use cases, and financial/monetary credits for qualifying customers to help them work through initial deployment or migrate their on-premises resources to the cloud. To gain a comprehensive understanding of the needs of these customers, we can leverage the average treatment effect (ATE) of Azure investments (as the treatment) for many corporate-level business application scenarios.

Causal-based attribution models can estimate the average effectiveness of multiple investments made over the course of customer tenure, enabling better fiscal planning and budgeting for investment programs. Here are some sample questions we can try to answer through causal modeling:

  • How much resourcing should each investment receive?
  • How can investment-level ROI be used for fiscal financial planning?
  • How can field performance be optimized?

Investment recommendations

Recommendation is one of the commonly used applications of causal analysis. We can think of applying a treatment to a unit as a recommendation and the unit’s individual treatment effect (ITE) as the estimated impact of the recommendation. In many application scenarios, we could potentially apply the ITE of customer investments (the treatment) as a proxy to rank investment programs as a future recommendation. The investment recommender looks at customers with similar confounding variables and uses their ITE values from the causal model to recommend the next best program for them.

Discounting

Causal inference also can be used to suggest the “best” discount to offer qualified customers. This can be considered in two ways. First, a price decrease can increase the likelihood of a customer purchase (the “extensive margin”). Second, a price decrease can increase the amount a customer purchases (the “intensive margin”). Economists use the term “elasticity” to describe the impact of price changes on quantity sold. Causal inference can help us understand the causal impact of a price decrease on the likelihood a qualified customer signs and uses our product/services, or the extensive margin elasticity. It also can help us understand usage, or intensive margin elasticity.

Causal churn model

Gaining new customers and retaining existing customers are important for any business, and our business is no exception. One widely used strategy to address retention is through churn predictions. Churn is a drain on revenue, as discussed in an earlier article. Even though it is possible to make a good ML model that predicts who is going to churn, it won’t provide specific guidance to what to do next. We learned the awesome idea about causal churn model from Gerben Oostra who introduced effective treatment of churn through causal model in his article. A machine learning model does not exist on its own — it is part of a bigger system. The output of the model is the input for some other business process or someone taking action. For example, as Gerben discussed in his blog, suppose a marketing team determines offers or runs creative campaigns to keep customers identified in a churn model. This separation between the model’s output and a separate marketing team taking action on it results in a local optimum. The data scientist predicts who is going to churn. A marketing department determines how to retain them. And that’s the issue. The problem is divided into two separate problems, solved by different teams, resulting in local optimal solutions.

As data scientists, we should get to know and stay in contact with all relevant business units and their members to understand the bigger picture — to know the actual actions taken and the inputs on which those actions are based. With that bigger picture, it is possible to determine relevant predictions as close as possible to the final action. In that way, instead of predicting who is likely to churn and then leaving campaign effectiveness to the marketing department, we can predict how to best retain each individual user. With such an approach we have a much better opportunity to more reliably predict outcomes. We are still in early stage of the exploration. For more implementation details, we would recommend Gerben’s blog which discussed this idea further.

Conclusion

We believe model validation is the most critical step for causal inference, because without checking the robustness of an estimate against unverified assumptions, causal estimates are invalid and biased. In addition, it is worth checking confidence intervals of causal estimates whenever possible. In this article, we discussed some model validation techniques useful for checking the robustness and sensitivity of estimates. We also shared some approaches for business validation. Last, we discussed some potential business applications of causal models. We hope this article, and the series it is part of, can help you with your own business problems. Please leave a comment to share your attribution scenarios and the techniques you are using today.

We’d like to thank the Microsoft Advanced AI school, Microsoft Research ALICE team, Finance, and Customer Program teams for being great partners in the research design and adoption of this work. We also would like to thank Ron Sielinski and Casey Doyle for helping review the work.

Related articles:

--

--

Jane Huang
Data Science at Microsoft

Jane Huang is a principal ML scientist at Microsoft CX Data and AI team. She focuses on GAI/LLM applications, machine learning, and causal inference.