Causal inference (Part 1 of 3): Understanding the fundamentals

Published in

Data Science at Microsoft

17 min readOct 22, 2020

By Jane Huang, Daniel Yehdego, Deepsha Menghani, Siddharth Kumar, Lisa Cohen, and Ryan Bouchard

Imagine that you start extending a credit offer to certain customers to try and boost sales. You see that purchases among those customers have increased during the time you made the special offer, but can you conclude it’s because of the credit offer?

To find out, perhaps you try to compare the “before” and “after” states among customers who received the offer. Or perhaps you try comparing sales between customers receiving the offer and customers not receiving it (known as naïve attribution, for reasons that become clear below). Perhaps your stakeholders are now challenging you with these analyses, however, because of double counting or other confounding effects they can observe or already know about.

So how can you confidently conclude that differences in sales are due solely to the credit offer? What would have happened if the interventions were not done? Causal inference can help answer these questions.

For decades, industries such as medicine, public health, and economics have used causal inference in the form of randomized control trials (RCTs). As powerful as these techniques are, however, they are also expensive. Fortunately, there is an alternative approach. Today, the wide availability of machine learning and large datasets makes causal inference — done with observational data or in quasi-experiments (a research design that tests for a cause-effect relationship without randomly assigned groups) — practical in a variety of industries without having to rely on the more expensive RCTs.

In the Microsoft Cloud+AI Customer Growth Analytics (CGA) team, we use causal analysis in many scenarios. For example, we have applied causal inference to understand how Azure investment affects revenue across customers, enabling the formulation of more targeted investment policies. We have also leveraged the pricing elasticity from causal models to help inform our pricing strategies.

In this article we build on a prior two-part article on attribution analysis that focused first on ways to quantify the correlation between treatments and outcome metrics and second on introducing causal inference techniques to determine attribution. In this new multi-part article about causal inference, we expand its application to more scenarios, explore various causal algorithms, and provide guidance for tackling your own business problems.

When causal inference can help

Causal inference can be helpful in several related situations. A basic one is analyzing the impact of investment or intervention, which is inherently a “treatment effect ” problem — one in which the intervention (or “treatment,” such as a credit offer) has a causal effect on an outcome variable (such as decision to purchase). The treatment effect can be measured at the population, treated group, subgroup, and individual levels.

Building further on this example, a treatment effect for each individual unit under study is defined as the difference between two potential outcomes: One outcome if the unit is exposed to the treatment and another outcome if the unit is exposed to the control. In practice, the individual treatment effect is unobservable because individual units can be either in the treatment or the control group, but not both. As a result, counterfactuals — what would have happened without an intervention — are the basis of causal inference, and the estimation of counterfactuals poses the biggest challenges but also provides the greatest opportunities in various scenarios.

With this in mind, causal inference can help provide answers to the following questions:

What is the average treatment effect of intervention?
What is the effect of a treatment on a customer’s spend, engagement, retention, and so on?
Will providing a treatment work?
Why did a treatment work?
In the case of a multi-treatment scenario, which investment should we recommend to customers?

Causal inference bridges the gap between prediction and decision-making. This is useful because prediction models alone are of no help when reasoning what might happen if we change a system or take an action, even for prediction models with extremely high accuracy. This is because going from prediction to a decision is not always straightforward. A typical supervised machine learning algorithm optimizes for the difference between actual values and their predicted values, but a decision based on such a prediction is not always one that maximizes the intended outcome when we take action. The very act of decision-making based on a prediction model may change the environment in ways that put us into untested territory, dampening the predictive power of the model.

For example, suppose a data scientist builds an accurate churn model to predict who is going to churn, and then a marketing team initiates offers or campaigns to keep those customers. Now the problem is divided into two separate problems, solved by different teams, resulting in local optimal solutions. The action taken on potential churn might change the environment, and therefore the optimal outcome. Instead of predicting who is likely to churn as one problem and leaving campaign effectiveness to the marketing department as a second problem, we can use causal inference to predict the best action to retain each individual customer.

Causal inference models can be applied to both experimental datasets (e.g., A/B testing) and observational datasets. While RCT is the traditional gold standard for these datasets, besides being expensive it is also time consuming, and even — in many cases — impossible to carry out in practice. This article focuses on the application of causal inference models to observational studies and natural experiments that can leverage large datasets.

Using the business goal to drive research design

A well-defined business goal helps to better shape any research design. In most situations, the population in a research study is heterogeneous. That is, characteristics may vary among individuals, potentially modifying treatment outcome effects. In our use cases for Azure investment, varying customer characteristics include geography, industry type, segment, size, and others. No matter how effective an investment program is overall, the actual effect may vary depending on individual customers and the specifics of the investment program. In other words, the treatment effect within subgroups may vary considerably from the average treatment effect (ATE), and variabilities in the direction and magnitude of treatments for individuals are explainable under causal mechanisms.

In our use cases, Azure investment program owners are interested in how investment effects vary across individual customers and contexts. They wish to go beyond the information provided by ATE and understand which customers experience large or small treatment effects, and customers for whom the treatment has beneficial or adverse effects. Some customers buy a product or grow anyway without promotion campaigns or investments (called “organic growth”). It’s possible for a campaign to trigger some customers (called “do not disturb” or “sleeping dogs”) who otherwise would not act.

Heterogeneous research can be used to inform ways of designing and deploying investment policies across multiple investment programs to maximize their effectiveness. If there are multiple investment programs within the same company working independently of each other, going beyond ATE for individual investment programs and adopting heterogeneous treatment effects in one multi-investment research design can inform optimal resource allocation.

Despite this heterogeneity, many studies may choose to estimate an ATE that implicitly assumes a similar treatment effect across the whole population. This approach is widely used in practice when the cost to implement customized investment recommendation is too high, when the sample size is too small to have confident heterogeneous estimates, or when the goal is simply to have a rough estimate of ATE when the program is still in a pilot or trial phase.

Even when we believe in the heterogeneous nature of the treatment itself, as a first step we might still be interested in whether the variance of treatment effects across subjects are statistically distinguishable from zero.

ATE is also reported in financial reports when the program owner wants to prove the overall effectiveness of the program and gain continuous support from stakeholders for planning and budgeting.

Causal inference fundamentals

Potential, actual, and counterfactual outcomes

Each unit of observational data has either received the treatment or not. That means one of the potential outcomes is an actual outcome (i.e., something that we see in the data). The other potential outcome is a counterfactual one. It is the outcome that would have occurred if something different had happened. If a unit was treated, we observe the outcome for being treated, which becomes the actual outcome. But we cannot observe the outcome if the unit didn’t get treated, which becomes the counterfactual outcome.

What that means is that the question of causality comes down to comparing actual outcomes with counterfactual outcomes; in other words, what would have happened if things had been different. Causal inference methods employ various assumptions to let us estimate the unobservable counterfactual outcome. By doing this, we can use them to make the appropriate comparison and estimate the treatment effect.

Confounders

Causal effects are changes in outcomes due to changes in treatments, holding all other variables constant. Basically, if something other than the treatment differs between the treated and untreated groups, we cannot conclusively say that any difference observed in the outcome is due solely to the treatment. This “mixing of effects” is a confounding effect. A confounding variable (also known as confounder) is a variable that influences both the treatment and outcome, causing a spurious association. Any variable that differs between the treatment and control groups could potentially be a confounding variable if it also influences the outcome. A correlation between having the treatment and having a good outcome, for example, could be due to confounders.

To identify confounders, the question to ask is whether there are any variables that are not constant across the two groups of the population. If the answer is yes, confounders might be a problem. Now, if the problem is as simple as pushing a toy car and seeing it move as a result, there is no need to consider any confounders. However, in almost all real-world applications when we are considering causal modeling for treatment effects, confounders are an inherent problem. The challenges lie in how to find and control for confounders. It’s important to remember that if a proposed confounding variable could not possibly affect the outcome, it does not matter whether it changes across treated or untreated groups. Everything else, however, could possibly be a confounder. For example, consider the classic example of observing a positive correlation between ice cream sales and homicide rates. Without looking at the nature of the two variables, we might think of ice cream sales as a candidate for being a confounder when predicting homicide rates. However, because ice cream sales cannot possibly affect the homicide rate, we don’t need to control for them across treatment and control groups.

Selection bias

Selection bias is a phenomenon in which the distribution of the observed group is not representative of the group we are interested in. Confounders usually affect treatment choices among units (i.e., the factors under study, such as persons, organizations, or anything else in the study), which leads to the selection bias. For example, in medicine, age is usually a confounder variable since people of different ages usually have different treatment preferences. As a result, we may observe that the age distribution of the treated group is significantly different from the age distribution of the observed control group. This phenomenon exacerbates the difficulty of counterfactual prediction as we need to estimate the control outcome of units in the treated group based on the observed control group, and similarly, estimate the treated outcome of units in the control group based on the observed treated group. If we directly train the potential outcome model on the data without handling selection bias, the trained model works poorly in estimating the potential outcome for the units in the other group. In the machine learning community, this type of problem brought by selection bias is also called covariate shift.

Instrumental variables (IV)

Experiments require that treatments be randomly assigned to units. Many causal algorithms under an assumption of unconfoundedness require us to measure every possible confounder, which is often impossible in practice. What do we do if we cannot do an experiment, or if we cannot perfectly observe all possible confounders? Instrumental variables analysis is one of the oldest but most important ways for learning about causality using quasi-experimental data. In contrast to RCT, where the treatment is randomly assigned, in instrumental variables analysis the treatment is not randomly assigned — but there’s another variable that is randomized and it’s correlated with the treatment we care about. Instrumental variables analysis involves several steps. First, we observe a variable, called the instrument, that is correlated with the outcome variable. We assume that the instrument does not have a direct causal effect on the outcome variable. The correlation that we see between the instrument and the outcome is not because the instrument has a causal effect on the outcome variable. Instead, the correlation is picking up the effect of some confounding variables.

Next, we make two more assumptions, that the instrument does have a causal effect on the treatment variable and that the instrument is randomly assigned to units (or effectively so). From these two assumptions, the causal effect of the instrument on the treatment is their correlation in the data. So here our approach is as if we were able to perform an RCT where we randomly assign the instrument to the unit and want to know what the causal effect of the instrument is on the treatment variable.

Because we made an assumption that the instrument variable is randomly assigned, we know that whatever correlation we see between the instrument and the treatment in the data is the causal effect of the instrument on the treatment and the instrument variable is not correlated with any other possible confounder except for the treatment.

So where does that leave us? We have a variable called the instrument that is correlated with the outcome but does not have a causal effect on the outcome. Since the correlation isn’t causal, it must be the result of a confounder. The instrument does have a causal effect on the treatment, so we might be picking up the causal effect of the treatment on the outcome through this correlation, or something else. But with the instruments randomly assigned, there is no correlation with any other confounders except the treatment. We have ruled out all possible explanations for the correlation between the instrument and the outcome, except one: There is a causal effect of the treatment on the outcome, which is what we are trying to get at. That is the essence of instrumental variable analysis. For example, in Figure 1, the instrument is independent of everything, except through treatment. In practice, the strength of the instrument is critical for IV analysis. In other words, we need to check how strongly the instrument is correlated with the treatment. Only when there is a strong instrument and it is randomly assigned are the causal estimates valid. Otherwise, if the instrument is very weak (i.e., the instrument and treatment aren’t correlated very much), estimates of the treatment effect have unbelievably high variance. In that situation, IV is not suggested.

Figure 1: Causal graph with instrumental variable

Causal inference assumptions

Before going into the details of various methods for causal estimation, let’s review some of the main assumptions that we need to make before we can link the potential outcomes to observed data. The identifiability of causal effects requires making some untestable assumptions for observational study. To describe the assumptions, we assume the observed data consists of an outcome Y, treatment variable T, and some set of treatment covariate variables X. Think of X as a collection of variables we want to control for, such as age, sex, race, and other demographics.

Assumption 1: Stable unit treatment value assumption (SUTVA)

The assumption made under SUTVA is that the potential outcomes for any unit/sample i do not vary with the treatment assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes. SUTVA allows us to write the potential outcome for the ith unit in terms of only that unit’s treatment. (For more information, see the reference Causal Inference for Statistics, Social, and Biomedical Sciences, Guido, Rubin 2015.) It emphasizes the following:

No interference: Units do not interfere with each other. Units, in this case, can be people, a company, a country, and so on. By no interference, we mean that the treatment assignment of one unit does not affect that outcome of another unit. An example of interference is performing a behavioral intervention, but the people in the study interact with each other. As an example, in motivational interviewing (MI), a behavioral intervention for treating drug use problems, two individuals with drug use problems may interact and share the experiences of their respective interventions with each other. The effectiveness of the intervention on one person might depend on the intervention on other people interacting with that person. Certain causal inference methods can handle interference.
One version of treatment: Potential outcomes should link effectively to observed data. For example, in studying the effect of exercise in lowering body fat, if one person exercises for 15 minutes and the other for one hour, and neither can tell which treatment version is assigned to which person, it will be difficult to isolate the effect of any one version of the treatment.

Assumption 2: Consistency

The potential outcome under treatment T = a, Yª is equal to the observed outcome if the actual treatment received is T = a. (See Pearl 2010.) In other words, Y = Yª if T = a for all a.

Assumption 3: Ignorability

This assumption can sometimes be referred to as the “no unmeasured confounders assumption”. (See A Survey on Causal Inference.) Given pre-treatment covariates X, treatment assignment is independent from the potential outcomes:

Y ⁰, Y ¹ ⫫ T |X

For example, suppose X is a single variable (age) that can take the values “younger” or “older”. Older people are more likely to get treatment T = 1. Older people are also more likely to have the outcome “hip fracture,” regardless of treatment. Thus, Y ⁰ and Y ¹ are not independent from T (marginally). However, within levels of X, treatment might be randomly assigned.

Assumption 4: Positivity

Positivity refers to the idea that everybody has some chance of getting either treatment, conditional on covariates X. (See A Survey on Causal inference.) At every level of X and for every treatment, units have a non-zero chance of getting treatment. In other words, treatment is not deterministic of X.

P (T = a │X = x) > 0 for all a and x

If, for some values of X, treatment was deterministic, we would have no observed values of Y for one of the treatment groups for those values of X. Variability in treatment assignment is important for identification.

Types of treatment effects

Our objective for causal inference is to estimate the treatment effect from the observational data. The treatment effect can be measured at the population, treated group, subgroup, and individual levels. In this article, we define the treatment effect under binary treatment, but it can be easily extended to multiple treatment cases. The definition for each type follows (see A Survey on Causal inference for more information).

Average treatment effect (ATE): At the population level, the treatment effect is called the average treatment effect (ATE), which is defined as:

ATE = E[Y(T = 1) — Y(T = 0)]

where T denotes the treatment, and Y(T = 1) and Y(T = 0) are the potential treated and control outcomes of the whole population, respectively.

Average treatment effect on the treated (ATT): For the treated group, the treatment effect is called the average treatment effect on the treated group (ATT), and is defined as:

ATT = E[Y(T = 1)|T = 1] — E[Y(T = 0)|T = 1]

where Y(T = 1)|T = 1 and Y(T = 0)|T = 1 are the potential treated and control outcomes of the treated group, respectively. ATT can also be called local average treatment effect (LATE).

Conditional average treatment effect (CATE): At the subgroup level, the treatment effect is called conditional average treatment effect (CATE), which is defined as:

CATE = E[Y(T = 1)|X = x] — E[Y(T = 0)|X = x]

where Y(T = 1)|X = x and Y(T = 0)|X are the potential treated and control outcomes of the subgroup with X = x respectively. CATE is also known as the heterogeneous treatment effect.

Individual treatment effect (ITE): At the individual level, the treatment effect is called individual treatment effect (ITE), and the ITE of unit i is defined as:

ITE = Yᵢ (T = 1) — Yᵢ (T = 0)

where Yᵢ (T = 1) and Yᵢ (T = 0) are the potential treated and control outcomes of unit i, , respectively. In related work from Kunzel et al. (2019), estimators that minimize the mean square error (MSE) of the individual treatment effect (ITE) of i also minimize the MSE for the CATE at xᵢ and so therefore the ITE is viewed equivalent to the CATE.

Please note that ATE and ATT are estimated at population level, but CATE and ITE are estimated at the subgroup or individual level. CATE (as mentioned, also known as heterogeneous treatment effect) is where the treatment effect varies across different subgroups. When the goal is to understand program-level treatment effects, we focus on ATE/ATT. When the goal is to estimate the individual level treatment effect, the heterogeneous research design is employed to estimate CATE.

List of open-source Python packages for causal inference

There are multiple Python packages that implement various statistical and econometric methods within the causal inference framework, also known as treatment effect analysis or uplift modeling. There are a few popular R packages as well, such as Causal Impact by Google.

Below is a summary of a few popular open source Python packages or toolboxes for performing causal analysis. Table 1 shows the research group and commits for each package. Table 2 summarizes supported algorithms. Table 3 shows techniques for model validation and interpretation. We also list the links for GitHub and documentations for your reference. Please be aware that summary statistics represent a snapshot from August 2020 that is expected to evolve over time.

Historically, many machine learning approaches focus on predicting outcomes, not understanding causality, while many traditional causal inference approaches have faced challenges from high-dimensional datasets and complex environments in the absence of RCT. Packages like EconML have played a pilot role in incorporating causality into our AI and machine learning systems. Many of them are collections of state-of-the-art techniques under common APIs for the estimation of treatment effects via machine learning.

As you can tell from the summaries, the combination of DoWhy and EconML (developed by Microsoft Research) is a powerful and comprehensive solution that covers numerous algorithms, model validations, and interpretation techniques. Moreover, the APIs of DoWhy and EconML are integrated with each other, so that you can seamlessly use both libraries in the same analysis (for example, check out the example notebooks on customer segmentation or recommendations A/B testing ). We highly recommend the integrated version of DoWhy and EconML. We advise our readers to consider the recency and activeness of maintenance of the packages when choosing the most appropriate packages for a given problem. Most of the toolkits are flexible and reusable for various contexts and scenarios, and data scientists will find them easy to pick up even with a limited or no background in economics.

GitHub

EconML Package developed by Microsoft Research (ALICE team): https://github.com/microsoft/EconML
Do Why is also another great package by Microsoft Researchers Amit Sharma, Emre Kiciman: https://github.com/microsoft/dowhy
CausalML is a great package by Uber: https://github.com/uber/causalml
CausalLift is a package for Uplift Modeling in real-world business and it is developed by Yusuke Minami: https://github.com/Minyus/causallift
Pylift is also another uplift modeling developed by Wayfair: https://github.com/wayfair/pylift
CausalInference is developed by Laurence Wong: https://github.com/laurencium/causalinference
Causality is developed by Adam Kelleher: https://github.com/akelleh/causality

Documentation

EconML: https://econml.azurewebsites.net/
CausalML: https://causalml.readthedocs.io/en/latest/about.html
DoWhy: https://microsoft.github.io/dowhy/
Pylift: https://pylift.readthedocs.io/en/latest/
CausalInference: https://causalinferenceinpython.org/

Table 1: List of Python packages for causal analysis

Table 2: Supported algorithms by package

Table 3: Model validation and interpretation

Conclusions

In this article, we walked through different business scenarios and discussed when and why causal models can help. We provided the fundamentals for causal inference and summarized seven popular Python causal packages to get started with. Many packages are super handy for those new to causal inference. We highly recommend the Microsoft-integrated version (a combination of DoWhy and EconML), which is a powerful and comprehensive solution being actively developed and featuring numerous algorithms, model validations, and interpretation techniques. In the next article in this series, we’ll dive into various causal inference algorithms as listed in Table 2 and provide guidance through algorithm selection for your own problem.

We’d like to thank the Microsoft Advanced AI school, Microsoft Research ALICE team, Microsoft Finance, and Microsoft Customer Program teams for being great partners in the research design and adoption of this work. We also would like to thank Ron Sielinski, Casey Doyle, Saurabh Kumar, Shijing Fang, and Saptarshi Chaudhuri for helping review the work.