Crash course in Causality

Published in

AI Skunks

21 min readApr 23, 2023

Imagine that you are walking down the street and you see a puddle of water. You can see that the puddle is caused by the rain, but you cannot see the rain itself. This is because the rain is a hidden variable.

Causal inference is the process of trying to figure out what the hidden variables are. In this case, we are trying to figure out what caused the puddle of water. We know that the rain caused the puddle, but we cannot see the rain itself. Causal inference is a challenging task, but it is an important one. By understanding causality, we can make better decisions and have a real impact on the world.

In data science, we often aim to answer questions about causality: Does X cause Y? If so, how strong is the causal effect of X on Y? These questions are important for understanding the world around us and making data-driven decisions. In this article, we’ll explore the concepts of causality and causal inference in data science, and discuss some of the challenges and techniques for inferring causal relationships from data.

What is Causality?
Correlation v/s Causation
Causal inference fundamentals
The language of causation
Treatment effects
Challenges in inferreing causality
Inferreing causality in air pollution
Conclusion
References

What is Causality?

Causality is a fundamental concept in science and philosophy, referring to the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first. Causality implies that there is a direct and meaningful relationship between two events, where the first event is necessary for the second to occur.

In data science, we often want to understand causal relationships between variables. For example, we may want to know if smoking causes lung cancer, or if a marketing campaign causes an increase in sales. However, inferring causality from observational data can be difficult, as there are often confounding variables that can affect both the cause and effect.

Correlation v/s Causation

While these two terms are often used interchangeably, they actually have very different meanings.

Correlation refers to a statistical relationship between two variables. When two variables are correlated, it means that there is a pattern or tendency for one variable to change in a predictable way when the other variable changes.

For example, there may be a positive correlation between ice cream sales and temperature, which means that as temperature increases, ice cream sales tend to increase as well. Correlation can be measured using a variety of statistical methods, such as Pearson’s correlation coefficient or Spearman’s rank correlation coefficient.

Causation, on the other hand, refers to a relationship where one variable directly causes another variable to change. In other words, causation implies that there is a mechanism or process that links the two variables together. Causation is much more difficult to establish than correlation, and typically requires experimental or quasi-experimental designs that can control for confounding variables.

One way to remember the difference between correlation and causation is the phrase “correlation does not imply causation.” Just because two variables are correlated, it does not necessarily mean that one variable causes the other. There may be other factors or variables that influence both variables, or the relationship may be purely coincidental.

Here’s an example to illustrate the difference between correlation and causation: Suppose we are interested in studying the relationship between ice cream sales and crime rates. We collect data on ice cream sales and crime rates in a particular city over several months, and find that there is a strong positive correlation between the two variables. In other words, as ice cream sales increase, crime rates tend to increase as well. However, it would be a mistake to conclude that ice cream sales cause crime. There are likely other variables that influence both variables, such as temperature or time of day. It’s also possible that the relationship is purely coincidental, and there is no causal link between the two variables at all.

Causal inference fundamentals

Counterfactuals:

Counterfactuals refer to the notion of what would have happened if a particular event or intervention had not occurred or if a different action had been taken. In other words, it is the hypothetical scenario of what would have been observed if the treatment or exposure was different from what it actually was.

For example, consider a study that investigates the effectiveness of a new medication on reducing blood pressure. The group of participants who received the medication is referred to as the treatment group, while the group that did not receive the medication is the control group. The outcome of interest is the change in blood pressure after a certain period.

The counterfactual outcome for a participant in the treatment group is the hypothetical outcome if they did not receive the medication. Similarly, the counterfactual outcome for a participant in the control group is the hypothetical outcome if they did receive the medication.

Instrumental variables:

Instrumental variables (IV) are a statistical technique used to estimate causal effects when there is unobserved confounding between the treatment and the outcome variable. IVs use a third variable, known as an instrument, that affects the treatment variable but has no direct effect on the outcome variable, to estimate the causal effect.

An example of instrumental variables could be in estimating the causal effect of education on earnings. Let’s say we are interested in knowing if a college degree causes higher earnings. However, there could be confounding variables that affect both education and earnings, such as natural abilities or socioeconomic status. An instrumental variable could be the distance to the nearest college. Distance to the nearest college is related to the likelihood of attending college, but has no direct effect on earnings. The idea is that people who live closer to a college are more likely to attend college, but their proximity to a college does not affect their earnings directly. By using distance to the nearest college as an instrument, we can estimate the causal effect of education on earnings.

The IV inference equation is:

Y = a + bX + e X = c + zD + u

Where:

Y is the outcome variable (e.g., earnings)

X is the treatment variable (e.g., education)

D is the instrument variable (e.g., distance to the nearest college)

a, b, c, and z are coefficients to be estimated

e and u are error terms

Randomized Control Trials (RCTs):

Randomized Control Trials (RCTs) are experiments in which individuals or groups are randomly assigned to either a treatment group or a control group to determine the causal effect of the treatment. This method is widely used in medical research, psychology, education, and social sciences.

Let’s look into an example of an RCT in the medical field:

Suppose a pharmaceutical company has developed a new drug for treating high blood pressure. To determine whether the drug is effective, the company conducts an RCT. They randomly select 500 patients with high blood pressure and divide them into two groups — a treatment group and a control group. The treatment group is given the new drug, and the control group is given a placebo (a harmless pill that looks identical to the drug). Both groups are unaware of whether they are receiving the drug or the placebo.

After a certain period, the company measures the blood pressure of both groups and compares the results. They find that the treatment group’s blood pressure has decreased significantly compared to the control group. Thus, they can conclude that the new drug is effective in treating high blood pressure.

The key advantage of RCTs is that they eliminate the possibility of confounding variables, which are factors that could affect the outcome but are not controlled for in the study. By randomly assigning participants to the treatment and control groups, any confounding variables are equally distributed between the two groups, making it easier to determine the causal effect of the treatment.

Propensity score matching:

Propensity score matching is a statistical technique used in observational studies to reduce the effects of confounding variables and estimate treatment effects. It involves creating a score, known as the propensity score, which predicts the likelihood of a participant receiving a treatment based on their observed covariates. Participants who received the treatment are then matched with those who did not receive the treatment, but have similar propensity scores.

For example, suppose a study is investigating the effect of a new drug on reducing cholesterol levels. The researchers recruit 100 participants, 50 of whom are randomly assigned to receive the drug and 50 who are assigned to receive a placebo. However, the researchers notice that the group receiving the drug has a higher percentage of males, older participants, and participants with a family history of high cholesterol. These variables are known to be associated with cholesterol levels, so the researchers decide to use propensity score matching to adjust for these variables.

The researchers use logistic regression to create a propensity score for each participant, which takes into account their age, sex, and family history of high cholesterol. They then match each participant who received the drug with a participant who did not receive the drug, but has a similar propensity score. The researchers can then compare the cholesterol levels between the matched pairs to estimate the treatment effect of the drug.

The language of causation

The language of causation refers to the set of tools and concepts used in data science and statistics to understand the relationships between variables and to establish causal effects. This includes the use of causal models, which are mathematical representations of the mechanisms by which variables influence each other, and the study of causal effects, which involve the investigation of the effects of interventions on variables of interest.

Causal Models:

Causal models are representations of the causal relationships between variables, often expressed as directed acyclic graphs (DAGs). In a DAG, variables are represented as nodes, and arrows indicate the direction of influence between variables.

For example, if we were interested in understanding the causal relationship between smoking and lung cancer, we could represent this relationship as a DAG with smoking as the cause and lung cancer as the effect.

Causal models are useful because they allow us to make predictions about the effects of interventions on variables of interest. By simulating the effects of different interventions on a causal model, we can explore the potential outcomes of different courses of action and make informed decisions.

Causal Effects:

Causal effects refer to the changes in a variable that result from a specific intervention. In other words, the causal effect of an intervention is the difference between what happens when the intervention is applied and what would have happened if the intervention had not been applied.

There are several different types of causal effects, including average causal effects, conditional causal effects, and counterfactual effects. Average causal effects refer to the overall effect of an intervention on a population, while conditional causal effects refer to the effect of an intervention on a specific subpopulation. Counterfactual effects refer to the difference in outcome that would have occurred if a different intervention had been applied.

Model checking:

To check whether a causal model is consistent with the data, researchers use a process called model checking. This involves comparing the predictions of the causal model to the observed data to determine whether they match.

For example, if we have a causal model that predicts that smoking causes lung cancer, we can test this by comparing the rate of lung cancer in smokers versus non-smokers. If the data shows a higher rate of lung cancer in smokers, this supports the causal model.

However, there are many factors that can affect the relationship between smoking and lung cancer, such as age, gender, and genetics. Therefore, researchers must take these factors into account when constructing the causal model and performing model checking.

Treatment effects

Treatment effects refer to the causal effect of a treatment or intervention on an outcome. In other words, it is the difference in outcome that can be attributed to the treatment itself, as opposed to other factors.

There are several types of treatment effects:

Average treatment effect (ATE):

The ATE is the average difference in outcome between a treatment group and a control group. It provides an estimate of the overall effect of the treatment on the population.

Example: A study is conducted to evaluate the effectiveness of a new medication for reducing blood pressure. Participants are randomly assigned to either receive the medication or a placebo. The ATE would be the average difference in blood pressure between the two groups.

Treatment effect heterogeneity:

Treatment effect heterogeneity refers to the variation in treatment effects across different subgroups of the population. This can be useful for understanding which groups of people benefit most from the treatment.

Example: In the same study as above, treatment effect heterogeneity could be evaluated by examining whether the treatment is more effective for men or women, or for people with different levels of baseline blood pressure.

Intention-to-treat effect (ITT):

The ITT refers to the effect of being assigned to a treatment group, regardless of whether the participant actually received the treatment. This is important because in some cases, participants may not comply with the treatment or may drop out of the study.

Example: In a study evaluating the effectiveness of a weight loss program, some participants may not adhere to the program and may not lose weight. The ITT analysis would still include all participants in the treatment group, regardless of their level of adherence.

Causal mediation effect:

The causal mediation effect refers to the extent to which the treatment effect is explained by changes in a mediator variable (i.e., a variable that is affected by the treatment and in turn affects the outcome).

Example: In a study evaluating the effectiveness of a smoking cessation program, the mediator variable could be the number of cigarettes smoked per day. The causal mediation effect would estimate the extent to which the treatment effect is explained by the reduction in cigarette smoking.

Average Treatment Effect on the Treated (ATT):

The Average Treatment Effect on the Treated (ATT) is the average treatment effect for the subset of the treatment group who actually received the treatment. This measure tells us what is the effect of the treatment for those individuals who actually received it. ATT is calculated by comparing the average outcome of the treated group with the average outcome of the control group for the subset of individuals who actually received the treatment.

Example: In a study of a job training program, we may be interested in the effect of the program for those who actually complete the training (the treated group). In this case, we would estimate the average treatment effect on the treated (ATT), which is the difference in outcomes between those who completed the program and those who did not, taking the average only among those who completed the program. For example, if the program completers had an average increase in earnings of $5000 per year, while the non-completers had an average increase of $2000 per year, the ATT would be $3000 per year.

Conditional Average Treatment Effect (CATE):

The Conditional Average Treatment Effect (CATE) is a measure of treatment effect for a specific subgroup within the treatment group based on their characteristics. This measure tells us the effect of treatment for individuals with specific characteristics. For example, we may want to know the effect of a new drug for individuals with a certain genetic mutation. CATE is calculated by comparing the average outcome of the treated group with the average outcome of the control group for individuals with a specific characteristic or set of characteristics.

Example: For the same study of class size and academic performance, we may be interested in the effect of class size for certain subgroups of students, such as those who are low-income or have learning disabilities. In this case, we may estimate the conditional average treatment effect (CATE) for each subgroup separately. For example, we may find that for low-income students, the effect of class size is much larger, with an ATE of 10 points, while for students without low-income status, the ATE is only 3 points.

Overall, understanding different types of treatment effects can help researchers to better evaluate the effectiveness of interventions and to identify which subgroups of the population may benefit most from the treatment.

Challenges in inferreing causality

Inferring causality is a challenging task in data science and statistics. Some of the key challenges include:

Confounding variables:

One of the biggest challenges in inferring causality from observational data is the presence of confounding variables. Confounding variables are variables that are related to both the cause and the effect, making it difficult to determine whether the cause is directly responsible for the effect.

For example, suppose we want to determine whether smoking causes lung cancer. We might collect data on smoking behavior and lung cancer rates and find a strong correlation between the two variables. However, there are many confounding variables that could be responsible for this correlation, such as age, gender, and exposure to other toxins. If we don’t take these variables into account, we may incorrectly conclude that smoking causes lung cancer when in fact it does not.

To address the problem of confounding variables, we can use a technique called regression analysis. Regression analysis allows us to control for the effects of other variables and isolate the effect of the cause on the effect. By controlling for confounding variables, we can better infer causality from observational data.

Reverse causality:

Reverse causality is a situation in which the presumed cause of an effect may actually be the result of that effect. In other words, the direction of the causation is reversed from what is initially assumed. This can lead to misleading or incorrect conclusions when attempting to establish causal relationships between variables.

An example of reverse causality is the relationship between exercise and weight loss. Initially, it is commonly assumed that exercise causes weight loss. However, in some cases, the direction of the causation can be reversed, and weight loss can actually cause exercise. For example, if a person loses a significant amount of weight due to an illness or other factor, they may subsequently engage in more exercise as a result of feeling better and having more energy. In this case, the weight loss is the cause and the exercise is the effect, even though the initial assumption was the opposite.

Selection bias:

Selection bias is a type of bias that occurs when the sample or population being studied is not representative of the entire population, leading to incorrect or misleading conclusions. It arises when the selection of participants into a study or experiment is not random, resulting in a non-representative sample.

For example, consider a study that aims to investigate the effectiveness of a new medication for a certain disease. The study only recruits patients from a single hospital, which is a specialized center for treating the disease. The study finds that the medication is highly effective in treating the disease. However, the conclusion is biased because the sample only includes patients who are receiving specialized care at the hospital and may not be representative of the larger population with the disease. Patients who are not receiving specialized care may have different characteristics and experiences that could affect the effectiveness of the medication.

Sample size:

Sample size refers to the number of individuals or units that are included in a study or experiment. In statistical analysis, sample size plays an important role in determining the reliability and generalizability of the findings. Small sample sizes can lead to low statistical power and increase the risk of Type II errors, meaning that a true effect is not detected. This can impact the ability to establish causality. In general, larger sample sizes are more desirable as they provide greater statistical power and precision, but they also come with increased costs and resources. Determining the appropriate sample size for a study depends on various factors, including the research question, study design, and available resources.

Inferreing causality in air pollution

Dataset description

(https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset):

This dataset provides geolocated information about major air pollutants, including Nitrogen Dioxide, Ozone, Carbon Monoxide, and Particulate Matter. These pollutants are harmful to human health and can cause respiratory and other diseases, as well as contribute to morbidity and mortality. The dataset also highlights the sources of these pollutants, such as cars, trucks, power plants, and household combustion devices.

The dataset includes the following features:

Country: name of the country
City: name of the city
AQI Value: overall air quality index value of the city
AQI Category: overall air quality index category of the city
CO AQI Value: air quality index value of carbon monoxide of the city
CO AQI Category: air quality index category of carbon monoxide of the city
Ozone AQI Value: air quality index value of ozone of the city
Ozone AQI Category: air quality index category of ozone of the city
NO2 AQI Value: air quality index value of nitrogen dioxide of the city
NO2 AQI Category: air quality index category of nitrogen dioxide of the city
PM2.5 AQI Value: air quality index value of particulate matter with a diameter of 2.5 micrometers or less of the city
PM2.5 AQI Category: air quality index category of particulate matter with a diameter of 2.5 micrometers or less of the city

dataset = "C:\\Users\\hmitt\\Downloads\\archive (1)\\global air pollution dataset.csv"

# Importing required packages

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import Image
import category_encoders as ce

import graphviz
import lingam
from lingam.utils import make_dot, make_prior_knowledge

df=pd.read_csv(dataset)
df

# Identify total no. of countries in the dataset
df['Country'].nunique()

# Identify total no. of cities in the dataset
df['City'].nunique()

# Checking for any city
df[df['City']=='Boston']

The below code is creating a set of five subplots, each representing the distribution of a particular air quality index value — AQI Value, CO AQI Value, Ozone AQI Value, NO2 AQI Value, PM2.5 AQI Value. The seaborn distplot() function is used to create a histogram and a density plot on the same axis.

my_column = ['AQI Value', 'CO AQI Value', 'Ozone AQI Value','NO2 AQI Value', 'PM2.5 AQI Value']
fig = plt.figure (figsize=(15,5))

for i in range(len(my_column)):
    plt.subplot (1,5,i+1)
    plt.title (my_column[i])
    sns.distplot (df, x=df[my_column[i]])

plt.tight_layout()
plt.show()

The distplot in each subplot represents the distribution of each of the air quality index values. From the plot, we can observe the shape of the distribution, which can give us an idea about the skewness and kurtosis of the variable. It can also help us identify the presence of outliers and the range of values that the variable takes.

Boxplots give us a visual representation of the distribution of the air quality index values, broken down by the AQI Category. This allows us to compare the distribution of the variable across categories and identify any patterns or differences in the data.

fig = plt.figure (figsize=(15,5))

for i in range(len(my_column)):
    plt.subplot (2,3,i+1)
    plt.title (my_column[i])
    sns.boxplot (data=df, x=df[my_column[i]], y=df['AQI Category'])
    
plt.tight_layout()
plt.show()

The boxplot for each air quality index value is broken down by the AQI Category, allowing us to compare the distribution of the variable across different categories. From the plot, we can observe the spread and central tendency of the data for each category, as well as any outliers that may be present. It can also help us identify if there are any systematic differences in the distribution of the variable across categories.

The pairplot() function creates a grid of scatterplots and histograms that allows us to visualize the relationships between pairs of variables in the DataFrame. This can be useful in identifying patterns and relationships between variables

sns.pairplot(df)

When applied to the given DataFrame, the pairplot() function generates a grid of scatterplots for all possible combinations of the numeric columns. The plot for each pair of columns displays a scatterplot along with the histograms for both variables. We can observe there is a positive correlation between AQI Value and PM2.5 AQI, as the data points in the scatterplot form a roughly upward sloping pattern. Conversely, if there is a negative correlation, the data points would form a roughly downward sloping pattern. We can also use the histograms to identify the distribution of each variable and detect any outliers.

The sns.heatmap() function in Seaborn library creates a colored matrix where the color of each cell represents the correlation between two variables. The darker colors indicate stronger correlations, and lighter colors indicate weaker correlations or no correlation.

sns.heatmap (df.corr(), cbar=False, annot=True, fmt='.2f', cmap='Blues')

The heatmap can be useful in identifying which variables are strongly correlated with each other. For example, if we see a dark blue square between two variables, that means they have a strong positive correlation. On the other hand, if we see a dark red square, that means they have a strong negative correlation. Identifying these patterns can help us in selecting variables for our causal models. Here, we can observe AQI Value and PM2.5 AQI have high correlation of 98%.

Lmplot between AQI Value and PM2.5 AQI Value with AQI Category

sns.lmplot (data=df, x='AQI Value', y='PM2.5 AQI Value', hue="AQI Category")

This lmplot creates a scatter plot between two variables, ‘AQI Value’ and ‘PM2.5 AQI Value’, with the data points colored based on their ‘AQI Category’. It also fits a linear regression line to the data points, which can help in understanding the relationship between the two variables.The line of best fit can also help to estimate the value of one variable given the value of the other variable.

updated_df = df.drop (["Country", "City", "AQI Value", "AQI Category", "CO AQI Category", "Ozone AQI Category", "NO2 AQI Category", "PM2.5 AQI Category"], axis=1)
updated_df = updated_df.dropna(how='any')

df_list = updated_df.columns.to_list()
my_data_dictionary = {}

for i, column in zip (range(len(df_list)), df_list):
    my_data_dictionary[column] = i

print("Elements in the dictionary: ", len(df_list))
print("Dictionary: ", my_data_dictionary)

PyGAM

PyGAM is a Python package for building and estimating generalized additive models (GAMs). GAMs are a flexible extension of linear models that allow for non-linear relationships between the dependent variable and independent variables. PyGAM provides a high-level interface for building GAMs, making it easier for researchers and data scientists to fit complex models and explore non-linear relationships in their data. It supports a variety of GAMs, including smoothing spline models, thin plate regression splines, and cubic regression splines. PyGAM also supports a wide range of distribution families and link functions, allowing for flexible specification of the response variable distribution.

LinGAM

LinGAM (Linear non-Gaussian Acyclic Models) is a statistical method for inferring causal relationships from multivariate time series data. It is particularly useful for dealing with non-Gaussian and nonlinear relationships between variables.

The basic idea behind LinGAM is to model the causal relationships between variables as a directed acyclic graph (DAG) and use a linear model to estimate the causal effect of each variable on the others. The algorithm tries to find a DAG that best represents the causal relationships between the variables, subject to the assumption that the errors in the model are non-Gaussian and linearly independent.

One of the key advantages of LinGAM is that it can handle both instantaneous and lagged causal effects. It also works well with high-dimensional data, where there are many variables to consider. The resulting causal graph can be useful for identifying key drivers of a system, predicting the effect of interventions, and understanding the underlying mechanisms of complex processes.

This code creates a LiNGAM model and fits it to the input data ‘updated_df’ which is assumed to contain variables that are causally related to each other. The ‘prior_knowledge’ parameter specifies any prior knowledge about the causal relationships that is available to the model.

prior_knowledge = make_prior_knowledge(4, paths = [[my_data_dictionary ['CO AQI Value'], my_data_dictionary ['NO2 AQI Value']]])

model = lingam.DirectLiNGAM(
                    random_state=10,
                    prior_knowledge = prior_knowledge,
                    measure = 'pwling',
                    )

model.fit(updated_df)

dot = make_dot(
     model.adjacency_matrix_,
     labels=updated_df.columns.to_list(),
    )

dot.format = 'png'
dot.render('PM2.5_Graph')

Image("PM2.5_Graph.png")

After fitting the model, the code generates a causal graph visualization using the ‘make_dot’ function from the ‘graphviz’ library. The graph represents the causal relationships inferred by the LiNGAM algorithm. The nodes in the graph represent the variables in the input data, and the edges between the nodes represent the causal relationships inferred by the algorithm. The direction of the edges indicates the direction of causality, with the arrowhead pointing from the cause to the effect.

We can observe from the graph that CO AQI value has the highest correlation coefficient with PM2.5 AQI value, followed by NO2 AQI value.

Conclusion

In conclusion, understanding causality is crucial in many fields, including public health, economics, and social sciences. Correlation and causation can often be confused, but causal inference methods help us distinguish between the two. Causal diagrams and model checking help us identify potential confounding factors and design appropriate experiments to infer causality. Treatment effects are a key component of causal inference, and different types of treatment effects can be estimated depending on the research question. However, there are many challenges in inferring causality, such as selection bias, reverse causality, and unobserved confounding. Applying causal inference methods to real-world problems, such as air pollution, can help us identify interventions that can improve public health. By incorporating these concepts into research and decision-making, we can make more informed and impactful decisions.

References

CausalInference: https://causalinferenceinpython.org/
Causal inference- Carnegie melon University: https://blog.ml.cmu.edu/2020/08/31/7-causality/
https://www.kaggle.com/code/sasakitetsuya/what-causes-pm2-5/notebook

Take a quiz to test your knowledge now!