Causal inference (Part 2 of 3): Selecting algorithms

Published in

Data Science at Microsoft

18 min readNov 5, 2020

By Jane Huang, Daniel Yehdego, and Siddharth Kumar

Introduction

This is the second article of a series focusing on causal inference methods and applications. In Part 1, we discussed when and why causal models can help with different business problems. We also provided fundamentals for causal inference analysis and compared a few popular Python packages for causal analysis. In this article, we dive into details of various causal inference estimation methods and discuss algorithm selection for your own problem settings. Causal inference can be used on top of A/B tests in multiple ways to extract insights, but this article focuses mainly on estimation methods under unconfoundedness or on quasi-experimental bases when a randomized control trial (RCT) is not feasible.

Algorithm selection

As discussed in Part 1, RCT is the traditional gold standard that enables unbiased estimation of treatment effects. However, in observational studies, the possibility of bias arises because a difference in the treatment outcome may be caused by a factor that predicts treatment rather than the treatment itself, known as confoundedness. As a result, it’s desirable to replicate a randomized experiment as closely as possible through various strategies.

Causal inference consists of a family of statistical methods. In this article, we introduce two types of estimation methods: Estimation methods under unconfoundedness (also known as conditioning-based methods), and estimation methods for quasi-experiments (a research design that looks like an experimental design but lacks the key ingredient — random assignment of the treatment — studying instead pre-existing groups that received different treatments after the fact).

In quasi-experiments, commonly used approaches include simple natural experiments, instrumental-variables (IV), and regression-discontinuity models. We will dive into IV approaches in a later section of this article and briefly introduce the other two here.

For natural experiments, rather than assuming ignorability over the entire dataset, we need to find subsets that approximate an experiment. A few common sources are by lottery, from prior A/B tests, and so on. Natural experiments allow common courses of treatment T and outcome Y, as long as the source is not affected by them. They exploit “as-if random” assignment of treatments to measure outcome. In other words, when the assignment of treatment is unrelated to the measured outcome and their common causes, we can treat it as if it is a randomized experiment. We need to keep in mind that as-if randomness is usually hard to find in practice. The causal estimates will be very sensitive to the violation of exclusion assumption.
A regression discontinuity model elicits the causal effects of treatments in situations where candidates are selected for treatment based on whether their values exceed a sharp cutoff or threshold. In most business problems that we are working with, a sharp cutoff for treatment assignment doesn’t always exist, but this approach is widely used in social policy studies and medical domains.

Most algorithms that we discuss in this article mainly look at one snapshot of the outcome for treatment effects rather than at repeated observations of outcomes over time. We have carried out separate research to estimate the delay and decay of treatment effects over time, which we won’t discuss in this article due to length constraints. If you would like to include repeated observations of outcome over time into your causal research design, depending on whether you have control time series data from a non-treated unit, you can consider approaches like difference-in-difference and interrupted time series synthetic control, among others. Uber has also discussed various causal inference methods for time series observations in their post.

Figure 1 shows high-level workflow recommendations for causal algorithm selection for an observational study with one snapshot of the outcome. Please keep in mind that the list is not exhaustive, just a small representation of the approaches that we use on the Customer Growth Analytics team here at Microsoft.

Figure 1: Flow chart for causal algorithm selection in an observational study

Estimation methods under unconfoundedness

Many heterogeneous treatment effect estimation methods are theoretically valid only when all potential confounders are observed. These methods attempt to approximate the gold standard of RCT. We refer to them as causal estimation methods under unfoundedness. It’s important to note that any attempt to use these methods without observing potential confounders can reduce, but not eliminate, bias relative to raw correlations.

Matching methods

Matching methods parallel the covariates distribution that predicts the treatment assignment and create a “pseudo-population” in which the treatment is independent of the measured confounding variables. They first look for units with control variables that have the same values but receive different treatments, which can be done post hoc by taking a treated unit and then finding a non-treated unit with very similar values. That pair is called a match. The next step is to write down the difference in outcome between the treated and untreated unit in that match, a process that can resemble RCT. As such, matching can be used to reduce or eliminate the effects of confounding variables on observational data when estimating a treatment effect. To compare the closeness between two units, a wide variety of distance metrics can be used, such as Euclidean distance and Mahalanobis distance. Generally, distance metrics can be represented using the following equation:

where xᵢ and xⱼ denote the features of the iᵗʰ and jᵗʰ unit.

Some matching methods develop their own distance metrics depending on the transformation function f. One of the commonly used transformation methods is propensity score-based transformation. In propensity score matching, we first estimate the propensity score, which can be thought of as the likelihood or probability that an individual unit receives the treatment. Estimating the propensity score can be done in different ways, but typically it is done using multivariate or binary classification models such as logistic regression, random forest, XGBoost, and LightGBM, among others.

But matching methods are not the only ones available. Several others are also explored below.

Re-weighting methods

Inverse propensity weighting (IPW): Like matching methods, re-weighting methods create a “pseudo-population” to address the challenge of selection bias due to different distributions of the treated and control groups. The main idea of weighting is to assign an appropriate magnitude to each sample in the observation dataset so that the distributions of the treated group and control group are similar. Then we can calculate statistics based on the re-weighted pseudo-population. When correctly applied, weighting can potentially improve efficiency and reduce the bias of unweighted estimators. If the propensity score e(x) is a conditional probability of assignment to a treatment given a vector of observed covariates x, then in the weighting method the outcome of the treated units is weighted by w(x) = 1 / e(x) while the control units are weighed as: w(x) = 1 / (1- e(x) )

Figure 2: Inverse propensity weighting example

After re-weighting, the IPW estimator of ATE is defined as (see A Survey on Causal Inference):

where n denotes the sample size, ê(x) is the estimated propensity score given features x, Tᵢ is the treatment assignment for iᵗʰ unit, and Yᵢ denotes the observed outcome for iᵗʰ unit.

Theoretical results show that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates. However, IPW highly relies on the correctness of propensity scores, which can be remedied in doubly robust learning, as described in the next section.

Doubly robust learning (DR): Doubly robust learning is a method for estimating heterogeneous treatment effects when the treatment is categorical and all potential confounders/controls are observed, but there are either too many for classical statistical approaches to be applicable or their effect on the treatment and outcome cannot be satisfactorily modeled by parametric functions (see EconML API documentation).

Doubly robust methods reduce the problem by estimating two predictive tasks:

Predicting the outcome from the treatment and controls
Predicting the treatment from the controls

Unlike the double machine learning method, which we introduce in a later section, the first model predicts the outcome from both the treatment and the controls, as opposed to just the controls. Then the method combines these two predictive models in a final estimation stage, creating a model of the heterogeneous treatment effect. In contrast to inverse propensity weighting (IPW), the DR approach first fits a direct regression model, and then debiases that model by applying an inverse propensity approach to the residual of that model, rather than to a training sample directly. The approach allows for arbitrary machine learning algorithms to be used for the two predictive tasks, while maintaining many favorable statistical properties related to the final model (e.g., small mean squared error, asymptotic normality, and construction of confidence intervals). The latter favorable statistical properties hold if either the first or second of the two predictive tasks achieves small mean squared error (hence the name “doubly robust”).

Meta-learning methods

Algorithms called meta-learners can take advantage of any supervised learning or regression methods in machine learning and statistics to estimate a treatment effect, such as the conditional average treatment effect (CATE) function discussed in Part 1. In meta-learner methods, the treatment space must be discrete. Meta-learners build on base algorithms — such as logistic regression (LR), random forests (RF), XGBoost, Bayesian additive regression trees (BART), or neural networks, among others — to estimate the CATE. In this section, we briefly introduce four types of meta-learners, including S-, T-, X-, and R-learners (see Causal ML API documentation).

Currently, the S-learner, T-learner, and X-learner algorithms are available in both the CausalML and EconML packages for Python. R-learner is available in Causal ML, which is the same as the Non-Parametric DML CATE Estimator in EconML with different naming conventions. And more generally, all DML CATE Estimators in EconML are special instances of R-Learner. The following sections provide a quick walkthrough of the main ideas and formulation of each algorithm. Links to the raw papers are provided for those who want to go beyond the high-level introduction and learn more.

S-learner: This algorithm estimates the target variable using all the covariate features and treatment indicator, without giving the treatment indicator any special role. The estimate is done using a single algorithm estimator, hence the name S-learner. The estimation consists of two stages. First, a predictive model is built using the outcome as a target and control for both treatment and other features. Then the difference is calculated among the estimated values when the treatment assignment indicator is changed from control to treatment, with all other features held fixed. The difference among the estimated values is the CATE for an individual unit. Different regressors, such as XGBoost, can be used as a base regressor in stage 1, and the one with lowest error in terms of SMAPE or other error metrics can be chosen.

Stage 1: Estimate the average outcome μ(x) with covariates X and an indicator variable for treatment effect T:

Stage 2: Define the CATE estimate as:

T-learner: This algorithm estimates the response functions separately for the treatment and control populations. First, it uses base learners to estimate the conditional expectations of the outcomes separately for units under control and those under treatment. Second, it takes the difference between these estimates. We refer to the general mechanism of estimating the response functions separately as the T-learner, “T” being short for “two.”

Stage 1: Estimate the average outcome μ₀(x) and μ₁(x)

where μ₀ (x) = E[ Y(0) | X=x] and μ₁(x) = E[ Y(1) | X=x] using arbitrary machine learning models.

Stage 2: Define the CATE estimates as

X-learner: When real-world data contains more control group than treatment group, the likelihood of overfitting the treatment group when using T-learner is high. X-learner tries to avoid this by using information from the control group to derive better estimators for the treatment group and vice versa. Like S- and T-learner, X-learner can use arbitrary machine learning algorithms like XGBoost as a base regressor with the lowest error in terms of SMAPE. X-learner is built on T-learner and uses each observation in the training set in an “X”-like shape, hence its name X-learner (see Künzel 2017). X-learner consists of three stages:

Stage 1: Estimate the average outcome μ₀(x) and μ₁(x)

where μ₀ (x) = E[ Y(0) | X=x ] and μ₁(x) = E[ Y(1) | X=x ] using arbitrary machine learning models.

Stage 2: Impute the user-level treatment effects Dᵢ¹ for user i in the treatment group based on μ₀(x) and Dⱼ⁰ for user j in the control groups based on μ₁(x):

Then estimate τ₁(x) = E[ D₁ ∣ X = x ] and τ₀(x) = E[ D₀ ∣ X = x ] using machine learning models.

Stage 3: Define the CATE estimate by a weighted average of μ₁(x) and μ₀(x):

where g ∈ [0,1] is a weighting function.

X-learner uses a weighting function for variance minimization. It is recommended to use an estimate of the propensity score for g, but it also makes sense to choose g = 1 or 0 if the number of treated groups is very large or small compared to the number of control groups. This means that having accurate propensity estimates is not as important for X-learner.

R-learner: The two steps within R-learner include (as outlined by Nie2017):

Step 1: Estimate marginal effects and treatment propensities to form an objective function that isolates the causal component of the signal.
Step 2: Optimize the data-adaptive objective function.

One critical component of R-learner is the R-loss function for treatment effect estimation using cross-fitting. Given n independent and identically distributed examples with (xᵢ, yᵢ, tᵢ), i = 1, …, n, where xᵢ denotes per-person features, yᵢ ∈ R is the observed outcome, and tᵢ is the treatment assignment. The potential outcome {Yᵢ(0), Yᵢ(1)} corresponds to the outcome we would have observed given the treatment assignment tᵢ = 0 or 1, respectively, such that Yᵢ = Yᵢ(tᵢ). Assuming unconfoundedness, i.e., the treatment assignment is randomized once we control for the features xᵢ, we estimate the CATE function 𝜏∗ = E{Y(1)−Y(0) | X = x }. The treatment propensity and the conditional surfaces can be represented as:

The CATE function 𝜏∗(x) can be re-written in terms of the conditional mean outcome as follows:

The first formula is called Robinson’s transformation for flexible treatment effect estimation, and builds on modern machine learning approaches. The main idea of R-learner is to use this representation to construct a loss function that captures heterogeneous treatment effects, so that we can accurately estimate the treatment effects by seeking a regularized minimizer of this loss function.

The formula can be equivalently expressed as (see Robins, 2004):

Then we estimate the heterogeneous treatment effect function 𝜏∗(.) by empirical loss minimization:

Sharing more details, we divide the data into K (e.g., 5 or 10) evenly sided folds. Let q(.) be a mapping from the i=1,…, n sample indices to K folds, and fit m^ and ê with cross-fitting over the K folds via methods tuned for optimal predicative accuracy. Then we can estimate the treatment effect via a plug-in version of the second formula:

Forest-based estimators

Forest-based methods utilize very flexible non-linear models of the heterogeneous treatment effect. They normally can perform well with many features. In addition, these methods can provide valid confidence intervals despite being data-adaptive and non-parametric. These methods are recommended if you have many features and would like to see the effect of heterogeneity and want confidence intervals. A few commonly used estimators are orthogonal random forests, causal forest (forest double machine learning) and forest doubly robust learner. For more details, please refer to Econ ML API documentation.

Stratification

To adjust selection bias, stratification (also known as subclassification or blocking) splits the entire group into homogeneous subgroups, where within each subgroup, ideally the treated group and the control group are similar under certain measurements over covariates. Then the treatment effect within each subgroup can be calculated using a method first developed on RCT data. With the CATE of each subgroup, the treatment effect over the interested group can be obtained by combining the CATEs of subgroups belonging to that group. For example, if we separate the whole dataset into j blocks, ATE for stratification is estimated as:

where Ȳₜ(j) and Ȳ꜀(j) are the average of the treated and control outcomes in the jᵗʰ block, respectively. Furthermore, q(j) = N(j)/N is the portion of the units in the jᵗʰ block to the whole units. (See: A Survey on Causal inference.)

Double machine learning

Double machine learning is a method for estimating heterogeneous treatment effects when all potential confounders are observed, but are either too many for classical statistical approaches to be applicable, or their effect on the treatment and outcome cannot be satisfactorily modeled by parametric functions. Distinct from doubly robust learner and meta learners, this method can be applied to both discrete and continuous treatment types. The method reduces the problem to first estimating two predictive tasks:

Predicting the outcome from the controls
Predicting the treatment from the controls

Then the method combines these two predictive models in a final stage estimation to create a model of the heterogeneous treatment effect. The approach allows for arbitrary machine learning algorithms to be used for the two predictive tasks, while maintaining many favorable statistical properties related to the final model (e.g., small mean squared error, asymptotic normality, and construction of confidence intervals). Mathematically, the double machine learning model is constructed as follows:

Stage 1:

Stage 2:

Linear regression on residuals:

Note: for 𝜃 constant, 𝜃 is the coefficient from

where Y is the observed outcome for the chosen treatment, T is the treatment, X represents the covariates used for heterogeneity, and W represents other observable covariates that we believe are affecting the potential outcome Y and potentially also the treatment T. We refer to variables W as controls. The variables X can also be thought of as control variables, but they are special in the sense that they are a subset of the controls with respect to which we want to measure treatment effect heterogeneity. We will refer to them as features.

If you are working with a continuous heterogeneous treatment space, double machine learning is recommended as the default algorithm for you to start with. Or, if you are working with a discrete heterogeneous treatment space, meta learners or doubly robust learner are worth checking out. We compare more fundamental details about meta learners with double machine learning methods in Table 1. In Table 1, the meta learner formulas were provided on two response surfaces Y(0) and Y(1) for illustration purposes, even though they can model on multiple response surfaces, Y(0) to Y(K), as well.

Table 1: Meta learners vs double machine learning.

Estimation methods with instruments

An instrumental variable is a random variable that influences the treatment but does not have any direct effect on the outcome, other than through the treatment. In practice, even when there are unobserved confounders (factors that simultaneously have a direct effect on the treatment decision in the collected data and the observed outcome), an instrumental variable (if it exits) represents an approach for estimating causal effects despite the presence of confounding latent variables. Instrumental variables (IV) can be used to cut correlations between the error term and independent variables in a model, hence conquering the problems related to endogeneity. The assumptions made are weaker than the unconfoundedness assumption needed in many other algorithms like double machine learning or meta learners. The cost is that when unconfoundedness holds, instrumental variable estimators will be less efficient. In the complex environment of the real world, the unconfoundedness assumption can rarely be satisfied. Unobserved confoundedness is a critical challenge of causal inference. Whenever instruments exist, it is better to prioritize the research design, but we need to be careful in model validation to make sure we have proper and strong instruments, otherwise estimates are very sensitive to violations.

In addition to the IV methods discussed below, there are a few new and important orthogonal IV methods for heterogeneous treatment effects like DMLIV (double machine learning instrumental variables), DRIV (doubly robust IV), and Intent To Treat DRIV available in EconML, which implement new methods for non-parametric estimation of CATE with instruments and arbitrary ML. For more details, please refer to Vasilis 2019.

Double least square: In the ordinary least square (OLS) method, there is a basic assumption that the value of the error term is independent of the predictor variables. When this assumption is broken, the double least square technique helps solve this problem. This analysis assumes that there is a secondary predictor that is correlated to the problematic predictor but not with the error term. Given the existence of the instrument variable, the following two methods are used:

In the first stage, we will obtain the predicted endogenous variable Ŷ₂ using OLS against all exogenous variables, including all of instruments. We should conduct an F-test on all instruments to see if instruments are jointly significant in the endogenous variable Y₂. Instruments should be strongly correlated with Y₂ to have reliable 2SLS estimators.
In the second stage, the model-estimated values Ŷ₂ from the first stage are used in place of the actual values Y₂ of the problematic predictors to compute an OLS model for the response of interest Y₁. All exogenous independent variables will be included in the second stage regression, but not the instruments.

Sieve two-stage least square estimation (2SLS): In contrast to the parametric nature of the double least square model, Sieve two-stage least square estimation (2SLS) is a nonparametric model. We must specify the sieve basis for T, X, and Y (Hermite polynomial or a set of indicator functions) and the number of elements of the basis expansion to include. For more details, please refer to Bruce 2014.

Deep instrumental variables (IV): The instrumental variables (IV) method is an approach for estimating causal effects despite the presence of confounding latent variables. The setup of the model is as follows:

Where E[∈|X,W,Z ] = h(X,W) , so that the expected value of Y depends on (T, X, and W). This is known as the exclusion restriction. We assume that the conditional distribution F ( T | X, W, Z ) varies with Z . This is known as the relevant condition. The heterogeneous treatment effects are what we want to learn:

The deep IV module learns the heterogenous causal effects by minimizing the “reduced-form” prediction error:

where the hypothesis class g represents neural nets with a given architecture.

This estimate is obtained by modeling F as a mixture of normal distributions, where the parameters of the mixture model are the output of a “first-stage” neural net whose inputs are (xᵢ, wᵢ, zᵢ). Optimization of the “first-stage” neural net is done by stochastic gradient descent on the (mixture-of-normals) likelihood, while optimization of the “second-stage” model for the treatment effects is done by stochastic gradient descent alone. The output is an estimated function ĝ. To obtain an estimate of τ, we take the difference the estimated function at t₀ and t₁, replacing the expectation with the empirical average over all observations with the specified x. For more details, please refer to Hardford 2017.

Conclusion

In this post, we went through details of various algorithms and provided insights into algorithm selection for different problem settings. One thing that we need to keep in mind is that for causal inference, we are aiming for robust and valid causal estimates. A more complicated model doesn’t always result in a more accurate treatment effect estimate in a causal context. Business goals and available data are the main drivers of algorithm selection. Then we can try and compare multiple estimation methods.

To validate the model, we do encourage trying multiple algorithms and comparing the estimates. We will introduce more details about model validation and application in the next article in our series. We hope this series helps you conquer your business problem through causal thinking. Please leave a comment to share your application scenarios and the techniques you are using today. We look forward to hearing from you.

We’d like to thank the Microsoft Advanced AI school, Microsoft Research ALICE team, Finance, and Customer Program teams for being great partners in the research design and adoption of this work. We also would like to thank Ron Sielinski, Casey Doyle, and Deepsha Menghani for helping review the work.