Causal Inference : An Introduction

Published in

Analytics Vidhya

7 min readMay 18, 2020

Lately, the concept of causality has been gaining popularity in the domain of machine learning and artificial intelligence due to its inherent relation to the working of the world. Causality refers to the relation between a cause and its effect, i.e., it is based on the assumption that every effect arises due to a specific cause or a set of causes. In this blog, we will get introduced to the basics of causal inference, familiarise ourselves with the terminology used in the domain and delve deeper into the science of counterfactuals and counterfactual inference, which is becoming increasingly popular in recent times.

Prerequisite : This blog expects the readers to be familiar with the basic concept of neural networks and how they are trained with the help of loss functions and optimizers.

Causal Inference

The domain of causal inference is based on the simple principle of cause and effect, i.e., our actions directly cause an immediate effect. With causal inference, we can directly find out how changes in policy (or actions) create changes in real world outcomes. Let us familiarise ourselves with terminology used in the domain.

Unit : An individual sample in the data.

Variable : A characteristic of the unit of analysis in the dataset. For example, countries, people, weather, etc.

Population : Collected set of all units of analysis.

Outcome Variable : The particular variable that we want to affect.

Policy / Treatment Variable : The variable used to create changes (i.e. actions).

Intervention : Changing the value of the treatment variable keeping other variable values constant in order to study the effect of changes in the treatment variable.

Confounder : Variables that affect both the input and the outcome variables.

Counterfactual Outcome : The outcome that would have occurred had the unit been exposed to a different treatment (cannot be observed).

Unit Level Causal Effect : Causal effect on a single unit in the data due to a change in the value of a particular variable keeping all other variables constant.

Interaction Effects : This refers to the fact that other variables may also influence the value of a causal effect.

Heterogeneity : Refers to the concept that different units exhibit different causal effects on being subjected to the same treatment.

Given an distribution of input variables X and a distribution of outcome variables (effects) Y, causal calculus differentiates between two kinds of conditional distributions - the observational distribution P(y|x) and the interventional distribution P(y|do(x)). P(y|x) represents the distribution of Y given that we observe variable X takes value x whereas P(y|do(x)) describes the distribution of Y that would be observed by artificially forcing the variable X to take a value x, keeping the rest of the variables in accordance with the process that generated the data. Such interventional distributions can be obtained using randomized controlled trials (RCTs). However, it might not always be possible to carry out such experiments due to ethical or practical concerns. More about causal inference and its relation with do-calculus can be found here.

The Fundamental Problem of causal inference is that in the real world, each unit can be subjected to just one of the multiple treatments and only the outcome corresponding to that treatment can be observed. The effect of other treatments on the same unit cannot be directly observed and that is what we want to predict using causal inference techniques.

Note : Correlation does not imply causation. Correlation refers to how two variables are related whereas causation studies if one variable affects the others. Two variables might be varying in a correlated fashion but one of them might not be causing changes in the other.

Frameworks for Causal Inference

Structural Equation Modelling (SEM). Structural equation modelling is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). The purpose of the model, in the most common form of SEM, is to account for variation and covariation of the measured variables.

Structural Equation Model — Relationship between academic and job constructs

Potential Outcomes Framework. Also known as the Rubin causal model (RCM), the potential outcomes framework is based on the idea of potential outcomes. For example, a person would have a particular income at age 40 if she had attended college, whereas she would have a different income at age 40 if she had not attended college. To measure the causal effect of going to college for this person, we need to compare the outcome for the same individual in both alternative futures. Since it is impossible to see both potential outcomes at once, one of the potential outcomes is always missing.

Counterfactual Inference

Counterfactual inference can, in a way, be defined as the prediction of an alternate reality. Given a pair of a cause and its effect, counterfactual inference focuses on answering the question — “What would have been the effect of a different treatment applied to the unit keeping all the other conditions constant?”. For example, say a patient suffering from a particular ailment is administered with a drug D1 and they show signs of improvement. A counterfactual enquiry in this scenario would be to ask whether the patient would still have shown a similar improvement had they been subjected to a different drug D2.

There are multiple ways of doing counterfactual inference but since this is an introductory blog, we will just be mentioning a few very basic modelling techniques.

Bayesian Structural Modelling. In this method, a causal graph is designed and Bayesian techniques are used to estimates outcomes corresponding to a specific initialisation. This paper proposes a similar model for counterfactual inference on time series data wherein a variable value is intervened at a particular time step and the effect of this intervention is observed in the subsequent time steps.

Deep Learning based Generative Modelling. This paper introduces a method based on Variational Autoencoders (VAE) which follows the causal structure of inference with proxies. A specific causal graph is considered as shown in the figure below.

Overall architecture of the model and inference networks for the Causal Effect Variational Autoencoder (CEVAE). White nodes correspond to parametrized deterministic neural network transitions, gray nodes correspond to drawing samples from the respective distribution and white circles correspond to switching paths according to the treatment t.

Deep Learning based Deterministic Approaches. In this technique, neural prediction networks are used such that given an input and the treatment the unit has been subjected to, a set of prediction values can be obtained. This paper proposes the TARNet which follows a similar approach wherein there are a set of shared layers where the information is shared between the treatments branches and then there is an individual branch for each treatment that the unit might be subjected to.

These are very basic approaches targeted at performing counterfactual inference. Other complications such disjoint distributions of the source and target samples and non-availability of sufficient training data for certain treatments among others are a topic for another blog.

Evaluation Metrics for Causal Inferences

In order to train the above discussed models, several evaluation metrics are used.

Individual Treatment Effect (ITE). Individual Treatment Effect estimation aims to examine whether a treatment T affects the outcome Y(i) of a specific unit i. If the treatment T(i) has not been applied to unit i, Y₀(i) is called the potential outcome of treatment T(i) = 0 and Y₁(i) the potential outcome of treatment T(i) = 1. The individual treatment effect on unit i is defined as the difference between the potential treated and control outcomes :

The challenge to estimate ITE lies in estimating the missing counterfactual outcome.

Average Treatment Effect (ATE). ATE is the average of all values of unit level causal effects in a population. The average outcome when all units are affected by the policy is called average outcome under the policy and the average outcome when none of the units are affected by the policy is called average outcome without the policy. The average treatment effect is the difference between the average outcome under the policy and the average outcome without the policy.

The average treatment affect (ATE) ψ is given by

Precision in Estimation of Heterogenous Effect (PEHE). In the binary setting, PEHE measures the ability of a predictive model to estimate the difference in effect between two treatments t₀ and t₁ for samples X. To compute the PEHE, we measure the mean squared error between the true difference in effect y₁(n) − y₀(n), drawn from the noiseless underlying outcome distributions µ₁ and µ₀, and the predicted difference in effect yˆ₁(n) − yˆ₀(n) indexed by n over N samples :

https://arxiv.org/pdf/1810.00656.pdf

When the underlying noiseless distributions µ are not known, the true difference in effect y₁(n) − y₀(n) can be estimated using the noisy ground truth outcomes y.

These 3 metrics have primarily been used for training and evaluating models performing counterfactual inference. Several other metrics aimed at maintaining balanced distributions in the latent space and nearest neighbour matched counterfactuals exists but that is a topic for another blog.

Conclusion

This blog has briefly touched upon the basics of causal inference and counterfactual studies in the domain. As mentioned previously, there’s a lot more to know about in this domain. However, more advanced topics would be covered in future blogs. I hope this blog serves as a worthy introduction to the domain of causal inference.

Thank you for your time in reading this blog. Kindly contact me if you have any queries about the topic.