Causality

Not Correlation.

Arnab Chakraborty
AI Skunks
9 min readApr 23, 2023

--

Causality refers to the relationship between two events or variables, where one event (the cause) directly brings about or influences the occurrence of another event (the effect). Establishing causality involves demonstrating that a change in the cause leads to a change in the effect, and that no other factors confound this relationship.

How’s it Different from Correlation?

Correlation is a statistical measure that describes the strength and direction of the association between two variables. A strong correlation indicates that the two variables are closely related, but it does not necessarily imply that one variable causes the other.

Let’s look at an example

Does consuming Ice-cream causes sunburn?

We observe that during the summer months, as ice cream sales increase, so does the incidence of sunburn. This suggests a positive correlation between ice cream sales and sunburn. However, we cannot assume that increased ice cream sales directly cause more sunburns.

In this case, a lurking variable, or a confounding factor, is likely responsible for the observed correlation.

The lurking variable here is the warm, sunny weather. As the temperature rises, people are more likely to spend time outdoors, increasing the risk of sunburn.

Simultaneously, the hot weather also drives up the demand for ice cream as people seek to cool down. The warm weather influences both ice cream sales and sunburns, but there is no direct causal link between the two.

Another Spurious Correlation

Here’s a pretty absurd example, there is a strong correlation between the number of Nicolas Cage films released each year and the number of people who drown in swimming pools.

Causation is when the movement of one number actually results from the movement of another number — a.k.a. “cause and effect.”

Are you curious to explore strange correlations? You can do so on Google Trends or try Tyler Vigen’s page.

Causal Inference Techniques

Causal inference is crucial in various fields such as economics, public policy, and healthcare, as it helps establish the causal relationship between variables rather than just their correlation. In this section, we will introduce five main causal inference techniques, explain each method briefly, and discuss their assumptions.

1. Randomized Controlled Trials (RCTs)

Randomized Controlled Trials involve randomly assigning subjects to a treatment group or a control group. The treatment group receives the intervention of interest, while the control group does not. By randomizing the assignment, researchers can isolate the causal effect of the treatment on the outcome variable.

Assumptions:

  • The random assignment ensures that both treatment and control groups are, on average, identical in all aspects except for the treatment, eliminating any confounding factors.
  • Participants must adhere to their assigned group (treatment or control) to maintain the randomization.

2. Propensity Score Matching (PSM)

Propensity Score Matching is a technique used to estimate the causal effect of a treatment when random assignment is not possible. It involves matching treated and untreated subjects based on their propensity scores, which are the estimated probabilities of receiving treatment given their observed characteristics.

Assumptions:

  • Unconfoundedness: All confounding factors are observed, and there are no unmeasured confounders.
  • Common support: There is a sufficient overlap in the propensity scores between treated and untreated subjects, ensuring that each treated subject has a comparable untreated subject.

3. Instrumental Variables (IV)

Instrumental Variables is a technique used to estimate causal effects when there is an unmeasured confounding variable or an endogeneity problem. It involves using an external variable (instrument) that is correlated with the treatment variable but not correlated with the outcome variable, except through its effect on the treatment variable.

Assumptions:

  • Relevance: The instrument must be correlated with the treatment variable.
  • Exogeneity: The instrument must not be correlated with the unmeasured confounding variables or the error term in the outcome equation.
  • Exclusion restriction: The instrument affects the outcome variable only through its effect on the treatment variable.

4. Regression Discontinuity Design (RDD)

Regression Discontinuity Design is a quasi-experimental design used to estimate the causal effect of a treatment when subjects are assigned to treatment or control groups based on a threshold value of an assignment variable. In RDD, the treatment effect is estimated by comparing outcomes just above and just below the threshold value.

Assumptions:

  • Continuity: The potential outcomes are continuous functions of the assignment variable around the threshold.
  • No manipulation: Subjects cannot manipulate the assignment variable to choose their treatment status.

5. Fixed Effects Models and Difference-in-Differences (DiD)

Fixed Effects Models and Difference-in-Differences are panel data techniques used to estimate causal effects when there is unobserved heterogeneity between subjects or groups. Fixed Effects Models control for unobserved, time-invariant subject-specific factors, while DiD estimates the causal effect of a treatment by comparing the changes in outcomes before and after treatment between treated and untreated groups.

Assumptions:

  • Parallel trends (for DiD): In the absence of treatment, the treated and control groups would have followed parallel trends in the outcome variable over time.
  • Time-invariant unobserved heterogeneity (for Fixed Effects Models): Any unobserved confounding factors are constant over time and can be differenced out or controlled for by including subject-specific fixed effects.

Let’s see how Causality works in a real dataset

Dataset

The World Happiness Dataset is an annual report that ranks countries based on their citizens’ subjective well-being, or happiness levels. The report is published by the United Nations Sustainable Development Solutions Network, and it has been released since 2012. The dataset is based on data from the Gallup World Poll and other sources, aiming to provide insights into how social, economic, and political factors can influence a country’s happiness.

Link:-https://github.com/chakraborty-arnab/DataSphere/blob/main/TEH_World_Happiness_2015_2019.csv

Link to Notebook:-https://colab.research.google.com/drive/1e-OyqSLgl0MexW8d9BTp2No_wmFV-UmD#scrollTo=vpt3_ubKsxtf

Correlation Matrix

A correlation matrix is a square, symmetrical matrix that represents the pairwise correlation coefficients between multiple variables. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.

import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation between all variables
correlation_matrix = data.corr()

# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, square=True, linewidths=0.5)
plt.title('Correlation Matrix of World Happiness Report Variables')
plt.show()

OLS Regression with Control Variables

Ordinary Least Squares regression is a popular statistical method for estimating the relationship between a dependent variable and one or more independent variables. By including control variables in the regression model, we can account for the confounding factors that might influence the dependent variable but are not of primary interest. This approach helps to isolate the causal effect of the main independent variable on the dependent variable.

The general form of an OLS regression model with control variables is:

Y = β0 + β1X1 + β2X2 + … + βkXk + ε

Where:

Y is the dependent variable X1, X2, …, Xk are the independent variables (including the main independent variable of interest and the control variables) β0 is the intercept β1, β2, …, βk are the coefficients of the independent variables ε is the error term

import statsmodels.api as sm

# Prepare the data for regression
X = data[['GDP per capita', 'Social support', 'Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']]
X = sm.add_constant(X)
y = data['Happiness Score']

# Fit the linear regression model
model = sm.OLS(y, X).fit()
print(model.summary())
 OLS Regression Results                            
==============================================================================
Dep. Variable: Happiness Score R-squared: 0.764
Model: OLS Adj. R-squared: 0.762
Method: Least Squares F-statistic: 418.0
Date: Sun, 23 Apr 2023 Prob (F-statistic): 4.88e-239
Time: 23:24:34 Log-Likelihood: -638.44
No. Observations: 782 AIC: 1291.
Df Residuals: 775 BIC: 1324.
Df Model: 6
Covariance Type: nonrobust
================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------
const 2.1779 0.080 27.279 0.000 2.021 2.335
GDP per capita 1.1504 0.083 13.923 0.000 0.988 1.313
Social support 0.6392 0.081 7.933 0.000 0.481 0.797
Healthy life expectancy 1.0016 0.131 7.621 0.000 0.744 1.260
Freedom to make life choices 1.4812 0.163 9.063 0.000 1.160 1.802
Generosity 0.5957 0.176 3.391 0.001 0.251 0.940
Perceptions of corruption 0.8424 0.223 3.782 0.000 0.405 1.280
==============================================================================
Omnibus: 16.182 Durbin-Watson: 1.468
Prob(Omnibus): 0.000 Jarque-Bera (JB): 18.176
Skew: -0.286 Prob(JB): 0.000113
Kurtosis: 3.481 Cond. No. 23.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Limitations of using OLS regression for causal inference:

  • Unobserved confounding: If there are unobserved confounding variables that are not included as control variables in the regression model, the causal interpretation of the coefficients may still be biased.
  • Reverse causality: OLS regression cannot address the possibility of reverse causality, where the dependent variable influences the independent variable.
  • Causality direction: The estimated coefficients only show the direction of the relationship, not the direction of causality.

Despite these limitations, OLS regression with control variables can be a valuable tool for causal inference, especially when combined with other techniques and robustness checks.

Propesity Score Matching

Propensity score matching is typically used in observational studies with binary treatment variables to balance the distribution of observed covariates between treatment and control groups, which helps reduce potential confounding. However, in the case of the World Happiness Report, GDP is a continuous variable rather than a binary treatment, and countries cannot be assigned to treatment and control groups as in a typical observational study.

Let’s create the binary treatment variable and propensity scores using logistic regression:

import numpy as np
from sklearn.linear_model import LogisticRegression

# Create a binary treatment variable based on the median GDP per capita
gdp_boundary = data['GDP per capita'].quantile(0.75)
data['high_gdp'] = (data['GDP per capita'] >= gdp_boundary).astype(int)

# Fit a logistic regression model to estimate propensity scores
X = data[['Social support', 'Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption']]
y = data['high_gdp']
propensity_model = LogisticRegression(random_state=42).fit(X, y)

# Calculate propensity scores
data['propensity_score'] = propensity_model.predict_proba(X)[:, 1]

Finally, let’s perform propensity score matching and compare the average happiness scores of the high-GDP and low-GDP groups:

from sklearn.neighbors import NearestNeighbors

# Split the data into high-GDP and low-GDP groups
high_gdp_data = data[data['high_gdp'] == 1]
low_gdp_data = data[data['high_gdp'] == 0]

# Perform propensity score matching using nearest neighbors
nn = NearestNeighbors(n_neighbors=1).fit(low_gdp_data['propensity_score'].values.reshape(-1, 1))
distances, indices = nn.kneighbors(high_gdp_data['propensity_score'].values.reshape(-1, 1))

# Create a DataFrame with matched pairs of high-GDP and low-GDP countries
matched_pairs = pd.concat([
high_gdp_data.reset_index(drop=True),
low_gdp_data.iloc[indices.flatten()].reset_index(drop=True)
], axis=1, keys=['high_gdp', 'low_gdp'])

# Compare the average happiness scores of the high-GDP and low-GDP groups
high_gdp_medn = matched_pairs['high_gdp']['Happiness Score'].mean()
low_gdp_mean = matched_pairs['low_gdp']['Happiness Score'].mean()
print(f"Average happiness score of high-GDP countries: {high_gdp_mean:.2f}")
print(f"Average happiness score of low-GDP countries: {low_gdp_mean:.2f}")
print(f"Difference in average happiness scores: {high_gdp_mean - low_gdp_mean:.2f}")
Average happiness score of high-GDP countries: 6.70
Average happiness score of low-GDP countries: 6.06
Difference in average happiness scores: 0.65

Let’s visualize

The first visualization shows the distribution of happiness scores for the high-GDP and low-GDP groups. The second visualization displays the distribution of propensity scores for the two groups.

import matplotlib.pyplot as plt
import seaborn as sns
# Plot the distribution of happiness scores for high-GDP and low-GDP groups
sns.histplot(data=high_gdp_data, x='Happiness Score', color='blue', alpha=0.5, kde=True, label='High GDP')
sns.histplot(data=low_gdp_data, x='Happiness Score', color='red', alpha=0.5, kde=True, label='Low GDP')
plt.xlabel('Happiness Score')
plt.ylabel('Frequency')
plt.legend(title='GDP Group')
plt.title('Happiness Score Distribution by GDP Group')
plt.show()

# Plot the distribution of propensity scores for high-GDP and low-GDP groups
sns.histplot(data=high_gdp_data, x='propensity_score', color='blue', alpha=0.5, kde=True, label='High GDP')
sns.histplot(data=low_gdp_data, x='propensity_score', color='red', alpha=0.5, kde=True, label='Low GDP')
plt.xlabel('Propensity Score')
plt.ylabel('Frequency')
plt.legend(title='GDP Group')
plt.title('Propensity Score Distribution by GDP Group')
plt.show()

Key Findings

  • Correlation between GDP and happiness scores does not necessarily imply causation. It is essential to account for potential confounding factors and biases when analyzing the relationship between two variables.
  • OLS regression models, which include control variables, are useful techniques for estimating the causal relationship between GDP and happiness scores. They help isolate the effects of GDP by controlling for potential confounding factors, such as social support, life expectancy, freedom, generosity, and corruption perceptions.
  • Propensity score matching is an alternative technique for causal inference that can help minimize biases arising from observed confounding factors. By matching treatment and control groups based on their propensity scores, it simulates the random assignment of a treatment, which aids in establishing causal relationships.

References

--

--

Arnab Chakraborty
AI Skunks

I’m currently pursuing a Master in Information systems at Northeastern University and I've 4 years of experience as a Data Scientist at Quantiphi