Causal Inference

Bowen Jin
AI Skunks
Published in
9 min readApr 28, 2023

Authors

Peichen Han, Bowen Jin

What is Causal Inference

Causal inference, a powerful tool in statistics and data science, involves pinpointing the precise impact of a particular phenomenon within a larger system. It’s an invaluable process for revealing the causal relationships between variables and uncovering the underlying reasons behind a phenomenon. Causal inference is widely used across fields such as biomedicine, economics, and social sciences, where it can be utilized to determine the impact of smoking on cancer, detect gender discrimination during recruitment processes, and estimate the quantitative causal effect of an event. This process allows individuals to make better-informed decisions and interventions by exploring how education levels affect future income or analyzing the response of an outcome variable in the face of changing causes.

Why Causal Inference

Causation and Correlation

Correlation is a very common concept in statistics and data science. We have become accustomed to all kinds of data analysis through correlation. However, correlation has its limits. The graph below shows the relationship between margarine consumption and the divorce rate in Maine

The graph clearly shows a high correlation between these two variables. However, drawing a causal relationship between margarine consumption and divorce rates seems illogical. It’s essential to understand that correlation does not necessarily mean causation. Correlation is typically symmetrical, while causation is asymmetrical. While correlation can exist without causation, causality often leads to correlation at a statistical level. In practical terms, it’s easier to comprehend and apply concepts that have causality rather than those that merely have correlation. With the advancement of data science, it’s only natural to consider causality and related principles.

Simpson's paradox

The table below shows success rates and number of successful and failed cases of two treatments for kidney stones.

Based on the information presented, which option would you select? If you only focus on large or small stones, you will discover that Treatment A is the superior choice. However, if you take into account both large and small stones, it appears that Treatment B is the more effective option. Although this may seem like an insurmountable paradox, doctors may not face this issue in reality. This is due to the fact that doctors are aware of the causal structure that underpins the data. The causal relationship between treatment, stone size, and failure rate may resemble the following in actuality:

The causal relationship depicted in the figure above may be interpreted as follows, based on the actual scene. Patients who have larger stone sizes are inherently more likely to experience treatment failure. In addition, medical practitioners tend to use treatment B for small stones and treatment A for large stones. Consequently, treatment A is more frequently associated with failure. This means that stone size is a common factor influencing both the choice of treatment and the success rate. To avoid any confusion arising from such confounding factors, it is essential to focus on the relationship between treatment and outcomes when considering stones of the same size. Doing so can enable us to determine which treatment is superior, as evidenced by its higher success rate when applied to either large or small stones. By considering causality in this way, we can better address practical problems and enhance our performance.

Examples

DoWhy

DoWhy is a Python library released by Microsoft for causal inference based on graphic model and potential outcome model. It provides a principled method to transform a given problem into a causal graph, and provides a unified interface for many commonly used causal inference methods, and combines two main causal inference frameworks. You can also use it to check correctness of assumptions and robustness of estimates automatically.

Simulate Dataset

In this article, for better learning, we first use dowhy library to simulate a causal dataset for causal inference.

Import library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import dowhy
import dowhy.datasets
from dowhy import CausalModel

Generate simulated data, in this case we simulate a linear relationship between different variables

data = dowhy.datasets.linear_dataset(beta=10, #real causal effects
num_common_causes=5, #confounders
num_instruments=2, #instrumental variable
num_effect_modifiers=1, #Effect modifier variable
num_samples=10000, #number of samples
treatment_is_binary=True,
num_discrete_common_causes=1)
df = data["df"]
print(data["dot_graph"])
digraph {v0->y;W0-> v0; W1-> v0; W2-> v0; W3-> v0; W4-> v0;Z0-> v0; Z1-> v0;W0-> y; W1-> y; W2-> y; W3-> y; W4-> y;X0-> y;}
print(df.head())
X0   Z0        Z1        W0        W1        W2        W3 W4    v0  \
0 1.047721 1.0 0.946104 -0.956627 -0.584224 0.660085 0.608764 3 True
1 2.538196 1.0 0.317835 -1.204281 -0.826717 1.662989 1.070910 0 True
2 -0.374068 0.0 0.672964 -0.850264 -0.255517 1.665433 -0.699174 2 True
3 -1.150634 1.0 0.375710 -0.909048 -0.437055 -0.127924 1.596841 3 True
4 1.368289 1.0 0.990431 -0.566798 -0.166837 1.614959 1.530499 3 True

y
0 23.689074
1 24.733789
2 17.815255
3 16.351174
4 36.342800

Model

DoWhy creates a causal graphical model for each question to keep causal assumptions explicit. The causal graph does not need to be complete, you can provide only partial graphs to represent prior knowledge of certain variables, and DoWhy supports automatically treating the remaining variables as potential confounding factors. Currently, DoWhy supports causal assumptions in these forms:
1、Graph: Provide a cause-and-effect graph in the form of gml or dot, which can be in file or string format.
2、Named variable sets: directly provide the type of variables, including cofounders, instrumental variables, effect modifiers, front-door variables, etc.
we will use 1 in this article.

model=CausalModel(
data = df,
treatment=data["treatment_name"],
outcome=data["outcome_name"],
graph=data["gml_graph"]
)
model.view_model()
plt.show()

Identify

Based on the constructed causal graph, DoWhy identifies causal effects in all possible ways. Specifically, graph-based criteria and do-integration are used to find expressions that identify causal effects. Supported identification criteria are:
1、Back-door criterion
2、Front-door criterion
3、Instrumental Variables
4、Mediation-Direct and indirect effect identification
In this article, we will focus on back-door criterion.

identified_estimand = model.identify_effect()
print(identified_estimand)
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
─────(E[y|W2,W0,W3,W4,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W0,W3,W4,W1,U) = P(y|v0,W2,W0,W3,W4,W1)

### Estimand : 2
Estimand name: iv
Estimand expression:
⎡ -1⎤
⎢ d ⎛ d ⎞ ⎥
E⎢─────────(y)⋅⎜─────────([v₀])⎟ ⎥
⎣d[Z₁ Z₀] ⎝d[Z₁ Z₀] ⎠ ⎦
Estimand assumption 1, As-if-random: If U→→y then ¬(U →→{Z1,Z0})
Estimand assumption 2, Exclusion: If we remove {Z1,Z0}→{v0}, then ¬({Z1,Z0}→y)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

back-door criterion

Given a pair of ordered variables (X, Y) in a directed acyclic graph (DAG), if the variable set Z (can be empty) satisfies:

There are no descendant nodes of X in Z.
Z blocks every path between X and Y that points to X.
If Z satisfies the above two points, it is said that Z satisfies the backdoor criterion about (X, Y).

In the above example of kidney stones, what we want is the causal relationship of treatment ---> failure rate. At this time, size satisfies the back-door criterion, so we can obtain correct causal effects by blocking non-causal association treatment <--- size ---> failure rate.

Estimate

DoWhy supports all estimation methods based on the identification criteria described above, and additionally provides nonparametric confidence space and permutation tests to test the statistical significance of the resulting estimates.
In this article, we will mainly use method based on back-door criterion.

estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_stratification")
print(estimate)
*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
─────(E[y|W2,W0,W3,W4,W1])
d[v₀]
Estimand assumption 1, Unconfoundedness: If U→{v0} and U→y then P(y|v0,W2,W0,W3,W4,W1,U) = P(y|v0,W2,W0,W3,W4,W1)

## Realized estimand
b: y~v0+W2+W0+W3+W4+W1
Target units: ate

## Estimate
Mean value: 15.994236372775601

Refuse

DoWhy supports a variety of refutation methods to verify the correctness of the estimate, the specific list is as follows:

1、Add random confounder: Estimate whether the causal effect will change after adding a random variable as a confounder (expected result: no)
2、Placebo intervention: Does the causal effect change after replacing the real intervention variable with an independent random variable (expected result: zero causal effect)
3、Dummy outcome: Does the causal effect change after replacing the real outcome variable with an independent random variable (expected result: zeroing out the causal effect)
4、Simulation Outcome: Does the causal effect change when the dataset is replaced by a dataset generated based on a simulation that approximates the data-generating process for a given dataset (expected result: matches the effect parameters of the data-generating process)
5、Add unobserved confounders: Sensitivity to causal effects after adding an additional confounder related to intervention and outcome (expected result: not overly sensitive)
6、Data subset validation: Does the causal effect change when a given dataset is replaced by a random subset (expected result: no)
7、Bootstrap Validation: Does the causal effect change when a given dataset is replaced by a bootstrap sample of the same dataset (expected result: no)

In this article, we will use 1 and 6.

# add random confounder
res_random=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")
print(res_random)
Refute: Add a random common cause
Estimated effect:15.994236372775601
New effect:15.994236372775594
p value:2.0
# data subset validation
res_subset=model.refute_estimate(identified_estimand, estimate,
method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)
Refute: Use a subset of data
Estimated effect:15.994236372775601
New effect:16.013462963117345
p value:0.92

Another Example With Real Data

Dataset

The dataset we used describes graduation rates at a four-year college.
Import the dataset first.

graduate = pd.read_csv('../content/sample_data/graduation_rate.csv')
graduate

Name of columns are too long, rename them.

graduate.rename(columns={'ACT composite score':'act', 'SAT total score':'sat', 'parental level of education':'pa_edu', 'parental income':'pa_in', 'high school gpa':'h_gpa', 'college gpa':'c_gpa', 'years to graduate':'years'}, inplace=True)
graduate

Investigating the correlation between high school GPA and college GPA is the objective of our study using this particular dataset. We have neglected to take into account the time spent on graduation as it is not relevant to our research. Our hypothesis is that students who have parents with higher education or income levels tend to have access to better resources, resulting in higher GPAs. Furthermore, we have included an indicator, U, to account for any other unobserved variables that may impact the results.

causal_graph = """digraph {
act;
sat;
h_gpa;
c_gpa;
pa_edu;
pa_in;
U[label="Unobserved Confounders"];
act -> h_gpa;
sat -> h_gpa;
h_gpa -> c_gpa;
pa_edu -> h_gpa; pa_edu -> c_gpa;
pa_in -> h_gpa; pa_in -> c_gpa;
U -> h_gpa; U -> c_gpa;
}"""

Bulid model based on causal graph above.

model= dowhy.CausalModel(
data = graduate,
graph=causal_graph.replace("\n", " "),
treatment='h_gpa',
outcome='c_gpa')
model.view_model()
plt.show()

Identify

identified_estimand = model.identify_effect()
print(identified_estimand)
### Estimand : 1
Estimand name: backdoor
No such variable(s) found!
### Estimand : 2
Estimand name: iv
Estimand expression:
⎡ -1⎤
⎢ d ⎛ d ⎞ ⎥
E⎢───────────(c_gpa)⋅⎜───────────([h_gpa])⎟ ⎥
⎣d[act sat] ⎝d[act sat] ⎠ ⎦
Estimand assumption 1, As-if-random: If U→→c_gpa then ¬(U →→{act,sat})
Estimand assumption 2, Exclusion: If we remove {act,sat}→{h_gpa}, then ¬({act,sat}→c_gpa)
### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Estimate(based on instrumental variable equation)

estimate = model.estimate_effect(identified_estimand,
method_name="iv.instrumental_variable",test_significance=True)
print(estimate)

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: iv
Estimand expression:
⎡ -1⎤
⎢ d ⎛ d ⎞ ⎥
E⎢───────────(c_gpa)⋅⎜───────────([h_gpa])⎟ ⎥
⎣d[act sat] ⎝d[act sat] ⎠ ⎦
Estimand assumption 1, As-if-random: If U→→c_gpa then ¬(U →→{act,sat})
Estimand assumption 2, Exclusion: If we remove {act,sat}→{h_gpa}, then ¬({act,sat}→c_gpa)
## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
⎡ d ⎤ -1⎡ d ⎤
E⎢────────(c_gpa)⎥⋅E ⎢────────(h_gpa)⎥
⎣dact,sat ⎦ ⎣dact,sat ⎦
Estimand assumption 1, As-if-random: If U→→c_gpa then ¬(U →→{act,sat})
Estimand assumption 2, Exclusion: If we remove {act,sat}→{h_gpa}, then ¬({act,sat}→c_gpa)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['h_gpa'] is affected in the same way by common causes of ['h_gpa'] and c_gpa
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome c_gpa is affected in the same way by common causes of ['h_gpa'] and c_gpa
Target units: ate## Estimate
Mean value: 0.9084360711049295

In the end, we check the robustness of the estimate using a Placebo refutation test.

ref = model.refute_estimate(identified_estimand, estimate, method_name="placebo_treatment_refuter", placebo_type="permute")
print(ref)
Refute: Use a Placebo Treatment
Estimated effect:0.9084360711049295
New effect:0.9058590125918934
p value:0.0

Very close, The results are consistent with common sense.

Reference:

[1]: Graduation Rate https://www.kaggle.com/datasets/rkiattisak/graduation-rate
[2]: Simpson’s paradox https://en.wikipedia.org/wiki/Simpson%27s_paradox
[3]: Causaulity https://zhuanlan.zhihu.com/p/269625734
[4]: dowhy https://github.com/py-why/dowhy
[5]: Causal Effects https://docs.google.com/presentation/d/1yKx_z5aXk6tEQZsSbKI5LfRR4Sa6XDHYyMqg9Zgztts/edit#slide=id.p

--

--