Crash Course in Causality — A simplified guide to Casual Inference

Cibaca Khandelwal
AI Skunks
Published in
13 min readApr 28, 2023

This article explains the concept of Causality, terminology related to casual effects and then deep dive into the area of Casual Inference

Image by rawpixel.com

The above image is the most common example that we come across in our day to day life. Global warming is the caused by the effects of Pollution caused by vehicles and industries, effect of weather changes.

Here Pollution, weather are the cause while global warming is the effect.

Table of contents

  1. Introduction
  2. Terminology
  3. Basics of Causality
  4. Potential Outcomes and Counterfactuals
  5. Casual Assumptions
  6. Methods of Casual Inferences
  7. Common Pitfall to avoid
  8. Evaluation metrics for Casual models
  9. Conclusion
  10. References

Introduction

A cause is something that produces or occasions an effect. Causality describes ideas about the nature of the relations of cause and effect. Causal inference is the thought process that tests whether a relationship of cause to effect exists.

Causal inference is the process of determining the causal relationship between two variables. It is the study of how one variable affects another variable. For example, in a medical study, causal inference would be used to determine whether a particular treatment caused a particular outcome.

Causal inference is an essential tool in many fields, including economics, political science, and epidemiology. In this article, we will provide a crash course in causal inference, covering the basics of causality, methods of causal inference, and some common pitfalls to avoid.

Terminology

  1. Causal effect: The difference in outcome between a group that receives a treatment and a group that does not receive the treatment.
  2. Treatment: The intervention that is being studied or evaluated.
  3. Outcome: The variable that represents the outcome of interest in the study.
  4. Confounding: The presence of a third variable that is associated with both the treatment and the outcome, making it difficult to determine the true causal effect of the treatment.
  5. Counterfactual: The hypothetical outcome that would have occurred if a subject had received a different treatment.
  6. Randomization: The process of randomly assigning subjects to treatment and control groups to minimize the impact of confounding variables.
  7. Propensity score: A measure of the likelihood of receiving a treatment based on a set of observed confounding variables.
  8. Standardization: A statistical technique that adjusts for confounding variables by comparing treatment and control groups with similar distributions of the confounding variables.
  9. Stratification: A statistical technique that divides the study population into subgroups based on levels of a confounding variable to compare the treatment and control groups within each subgroup.
  10. Mediation: A causal mechanism by which the treatment affects the outcome through an intermediate variable or set of variables.

The Basics of Causality

Before we dive into the methods of causal inference, it’s essential to understand the basics of causality. Causality is the relationship between two events, where one event is the cause of the other event. There are three essential criteria for causality

1. Temporal Order

The first criterion, temporal order, is relatively straightforward. The cause must come before the effect. For example, if we are studying the effect of smoking on lung cancer, smoking must occur before lung cancer to be considered a cause.

2. Covariation

This principle states that there must be a relationship between the cause and effect. This means that as the cause changes, the effect should change as well. This principle is important because it helps to establish a connection between the cause and effect, and to rule out the possibility that the relationship between the two variables is simply due to chance.

3. Non-spuriousness

The third criterion, non-spuriousness, is perhaps the most challenging to understand. Non-spuriousness means that the association between the cause and effect cannot be explained by a third variable. For example, if we find an association between smoking and lung cancer, we must consider the possibility that a third variable, such as age, could be driving the association. Perhaps older people are more likely to smoke and more likely to develop lung cancer. In this case, age would be a spurious variable, meaning that it explains the association between smoking and lung cancer. To determine causality, we must control for spurious variables.

Potential Outcomes and Counterfactuals

Potential outcomes refer to the different outcomes or results that could occur under different conditions or treatments in a research study or experiment.

  • Specifically, potential outcomes are the hypothetical values that an outcome variable (such as a health outcome or a test score) could take for an individual or a group of individuals under different conditions.
  • For example, in a medical study comparing the effectiveness of two treatments for a certain condition, potential outcomes would refer to the possible health outcomes (e.g., symptom improvement or worsening, side effects) that could occur under each treatment for each individual in the study.
  • The concept of potential outcomes is central to the framework of causal inference in statistics and research methodology, as it helps researchers identify the causal effects of different interventions or treatments on outcomes of interest.
  • Sure, here are some examples of potential outcomes:
  1. In a study investigating the effect of a new drug on blood pressure, the potential outcomes for each participant could be the blood pressure level they would have if they received the drug (treatment) and the blood pressure level they would have if they did not receive the drug (control).
  2. In a study examining the effect of a parenting intervention on child behavior, the potential outcomes for each child could be the behavior they would exhibit if their parents received the intervention (treatment) and the behavior they would exhibit if their parents did not receive the intervention (control).
  3. In a study investigating the impact of a new teaching method on student test scores, the potential outcomes for each student could be the test score they would achieve if they were taught using the new method (treatment) and the test score they would achieve if they were taught using the traditional method (control).

In each of these examples, potential outcomes refer to the different outcomes that could occur under different conditions or treatments.

Counterfactuals refer to hypothetical scenarios or situations that could have occurred, but did not actually happen.

  • In the context of research methodology and causal inference, counterfactuals are used to assess the causal effect of an intervention or treatment by comparing what actually happened to what could have happened under different conditions.
  • For example, imagine a study investigating the effect of a new medication on patient outcomes. The counterfactual scenario for a patient who received the medication would be what would have happened to the patient if they had not received the medication. The comparison of the actual outcome to the counterfactual outcome allows researchers to estimate the causal effect of the medication on the patient’s outcome.
  • Counterfactuals are important in determining causality because they provide a way to compare what happened to what could have happened. By considering counterfactual scenarios, researchers can make inferences about the causal effect of an intervention or treatment, even if that treatment was not randomly assigned or if the outcome cannot be directly observed under both treatment and control conditions.

Example of potential outcomes and counterfactuals: A study is conducted to evaluate the effect of a new teaching method on student test scores. The potential outcomes for each student are the test score they would achieve if they were taught using the new method (treatment) and the test score they would achieve if they were taught using the traditional method (control). The counterfactual scenario for a student who received the new teaching method would be what would have happened to the student if they had been taught using the traditional method instead. By comparing the actual outcome to the counterfactual outcome, researchers can estimate the causal effect of the new teaching method on test scores.

Casual Assumptions

Causal assumptions in causal effects models are assumptions made about the underlying causal relationships between variables in a study or experiment. These assumptions are necessary to estimate the causal effect of an intervention or treatment on an outcome of interest.

Causal assumptions typically include two key components: (1) the treatment assignment mechanism and (2) the ignorability assumption.

  1. The treatment assignment mechanism refers to the process by which participants are assigned to the treatment or control group. It is assumed that the assignment mechanism is independent of the potential outcomes, meaning that treatment assignment is not related to the outcomes themselves.
  2. The ignorability assumption, also known as the unconfoundedness assumption, states that the potential outcomes for each participant are independent of the treatment assignment, given a set of covariates or confounding variables. In other words, it is assumed that the treatment and control groups are comparable with respect to all relevant covariates, so that any differences in outcome can be attributed to the treatment itself.

Other causal assumptions may be made depending on the specific study design and research question, such as the stable unit treatment value assumption (SUTVA), which assumes that the treatment effect on an individual does not depend on the treatment status of other individuals, or the consistency assumption, which assumes that the potential outcomes are consistent with the treatment assignment for each individual.

Causal assumptions are important in causal effects models because they guide the selection of appropriate statistical methods and the interpretation of study results. Violations of these assumptions can lead to biased estimates of causal effects and incorrect conclusions about the effectiveness of interventions or treatments.

Methods of Causal Inference

There are several methods of causal inference, each with its strengths and weaknesses. We will discuss three methods of causal inference: randomized controlled trials, natural experiments, and regression analysis.

1. Randomized Controlled Trials

Randomized controlled trials (RCTs) are considered the gold standard for causal inference. In an RCT, participants are randomly assigned to a treatment or control group. The treatment group receives the intervention, and the control group does not. The two groups are then compared to determine the effect of the intervention.

RCTs are powerful because they control for spurious variables. Because participants are randomly assigned to treatment or control groups, any differences between the groups can be attributed to the intervention. For example, in a medical study, RCTs can control for age, gender, and other demographic factors that might affect the outcome.

However, RCTs can be expensive and time-consuming, and they may not be feasible in all situations. For example, it would not be ethical to conduct an RCT to determine the effect of smoking on lung cancer because it would require intentionally exposing participants to a harmful substance.

2. Natural Experiments

Natural experiments are situations in which the treatment and control groups are not assigned randomly but are created by natural circumstances. For example, if a new law is passed in one state but not another, the two states can be compared to determine the effect of the law.

Natural experiments are situations in which the treatment and control groups are not assigned randomly but are created by natural circumstances. For example, if a new law is passed in one state but not another, the two states can be compared to determine the effect of the law.

Natural experiments can be useful when RCTs are not feasible, but they have limitations. Unlike RCTs, natural experiments cannot control for spurious variables, making it more challenging to determine causality. Researchers must carefully select natural experiments to ensure that the treatment and control groups are comparable and that any differences between the groups can be attributed to the intervention.

3. Regression Analysis

Regression analysis is a statistical method used to determine the relationship between two or more variables. In causal inference, regression analysis can be used to control for spurious variables. For example, if we are studying the effect of education on income, we might use regression analysis to control for factors such as age, gender, and occupation.

Regression analysis is a powerful tool for causal inference, but it has limitations. It can only control for variables that are included in the analysis. If important spurious variables are left out of the analysis, the results can be misleading.

Common Pitfalls to Avoid

Causal inference is a complex process, and there are several common pitfalls to avoid. Here are three common pitfalls to keep in mind:

1. Confounding Variables

Confounding variables are variables that affect both the cause and the effect. For example, in a study of the effect of exercise on weight loss, age could be a confounding variable. Older people may be less likely to exercise and more likely to gain weight, making it difficult to determine the effect of exercise.

To avoid confounding variables, researchers must carefully control for all variables that could affect the outcome.

2. Selection Bias

Selection bias occurs when participants in a study are not representative of the population being studied. For example, if a study of the effect of a new drug is conducted only on healthy adults, the results may not apply to people with underlying health conditions.

To avoid selection bias, researchers must carefully select participants to ensure that they are representative of the population being studied.

3. Reverse Causality

Reverse causality occurs when the cause and effect are reversed. For example, if a study finds an association between depression and heart disease, it is possible that heart disease is causing depression, not the other way around.

To avoid reverse causality, researchers must carefully consider the temporal order of events and control for all possible confounding variables.

Evaluation metrics for causal inference

There are several evaluation metrics for causal inference, depending on the specific method used. Here are some commonly used metrics:

  1. Average Treatment Effect (ATE): This is the most basic and commonly used evaluation metric in causal inference. ATE is defined as the difference in the expected outcome between the treatment and control groups. It represents the average causal effect of the treatment on the outcome variable. ATE can be estimated using various methods, such as regression models, propensity score matching, and inverse probability weighting.
  2. Treatment Effect Heterogeneity (TEH): TEH is a measure of how the causal effect of the treatment varies across different subgroups of the population. It can be estimated using methods such as stratification, interaction models, and subgroup analysis.
  3. Counterfactual Evaluation Metrics: These metrics compare the observed outcome to the counterfactual outcome that would have occurred if the treatment had not been given. Common counterfactual evaluation metrics include the Average Treatment Effect on the Treated (ATT), which measures the causal effect of the treatment on the treated individuals, and the Average Treatment Effect on the Control (ATC), which measures the causal effect of the treatment on the control group.
  4. Causal Inference Performance Metrics: These metrics evaluate the performance of the causal inference method itself, rather than the accuracy of the causal effect estimate. Common performance metrics include balance diagnostics, which assess whether the treatment and control groups are balanced on confounding variables, and sensitivity analysis, which assesses the robustness of the causal effect estimate to various assumptions and model specifications.

Overall, the choice of evaluation metric depends on the specific research question and the causal inference method used. It is important to carefully select the appropriate metric and interpret the results in the context of the research question and the assumptions underlying the causal inference method.

Example of Casual Inference on Titanic Dataset

  1. Install the casual inference library
! pip install causalinference

2. Import libraries, Load and clean the Dataset

import pandas as pd
import numpy as np
import causalinference as ci

# Load the "Titanic" dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)

# Remove columns that are not relevant for causal inference
data = data[[ 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived', 'Pclass']]

# Convert categorical variables to binary indicators
data = pd.get_dummies(data, columns=['Pclass'])

# Create the Sex variable, and change boolean values to 1 and 0
data['Sex'] = data['Sex'].apply(lambda x: 1 if x == True else 0)
# Remove missing values
data = data.dropna()

# Rename columns
df = data.rename({'Pclass_3': 'treatment_Pclass_3', 'Survived': 'outcome_Survived'}, axis=1)

df = df[['Age', 'SibSp', 'Parch', 'Fare', 'Sex','Pclass_1', 'Pclass_2', 'treatment_Pclass_3', 'outcome_Survived']]

3. Lets see how the data looks like now

df.head()

4. Divide the data into treatment and control group

TREATMENT = 'treatment_Pclass_3'
OUTCOME = 'outcome_Survived'
df.groupby(TREATMENT)[OUTCOME].describe()

5. Run Casual Model

# Run causal model
causal = ci.CausalModel(Y = df['outcome_Survived'].values, D = df['treatment_Pclass_3'].values, X = df[['Age', 'SibSp', 'Parch', 'Fare', 'Sex','Pclass_1', 'Pclass_2']].values)
# Print summary statistics
print(causal.summary_stats)

Here we can see that there are 359 control group and 355 treated observations

6. Calculate the propensity score

# Automated propensity score estimation
causal.est_propensity_s()
# Propensity model results
print(causal.propensity)

Conclusion

Causal inference is a powerful tool for understanding the relationship between two variables. It is essential for making informed decisions in many fields, including medicine, economics, and political science. By understanding the basics of causality, the methods of causal inference, and the common pitfalls to avoid, researchers can make accurate causal inferences and make better-informed decisions. While there is no one-size-fits-all approach to causal inference, understanding the strengths and limitations of different methods can help researchers choose the best approach for their specific research question.

References

  1. ChatGPT
  2. Morgan, S. L., & Winship, C. (2015). Counterfactuals and Causal Inference: Methods and Principles for Social Research (2nd ed.). Cambridge University Press.
  3. “Causal Inference for Observational Studies” by MIT OpenCourseWare (https://www.youtube.com/watch?v=zS1jBwfoYuk)
  4. “Causal Inference 101” by Andrew Gelman (https://www.youtube.com/watch?v=3R3PL5qraGw)
  5. “Causal Inference in Statistics” by Jamie Robins (https://www.youtube.com/watch?v=7bRJ8tWFO7U)

--

--

Cibaca Khandelwal
AI Skunks

Tech enthusiast at the nexus of Cloud ☁️, Software 💻, and Machine Learning 🤖, shaping innovation through code and algorithms.