Crash Course in Causality

Published in

AI Skunks

12 min readApr 23, 2023

What is causality

Causality is the concept of the relationship between a cause and its effect. It is a fundamental concept in many fields, including philosophy, science, and statistics. In science, causality is the foundation of our understanding of how the natural world operates. In statistics, causality is concerned with identifying the causal relationship between variables, and it is a critical tool in making predictions and making decisions based on data.
We can take smoking and lung cancer as an example in medical field. This is a well-established causal relationship where smoking is the cause and lung cancer is the effect. Studies have consistently found a strong association between smoking and lung cancer, and experimental studies have demonstrated that smoking can cause cancer.
We accept this conclusion naturally, but here comes a question: why we can conclude a causal relationship between smoking and lung cancer? What are the characteristics of this relationship?
In fact, there are several key components to causality that must be understood in order to establish a causal relationship between two variables:

Temporal precedence: This refers to the idea that the cause must occur before the effect. In other words, the cause must come first in time, and the effect must follow.
Empirical association: This refers to the idea that there must be a statistical association between the cause and the effect. In other words, the cause and effect must be observed to occur together more often than would be expected by chance.
The absence of alternative explanations: This refers to the idea that the causal relationship cannot be explained by other factors. In other words, the observed association between the cause and effect must not be due to other factors that could explain the relationship.
Plausible mechanism: This refers to the idea that there must be a plausible mechanism by which the cause can produce the effect. In other words, it must be possible to explain how the cause leads to the effect.
Coherence: This refers to the idea that the causal relationship must be consistent with other known facts about the world. In other words, the causal relationship must fit into a broader understanding of how the world works.

You may have noticed that many other causal relationships are derived based on these components above, like lower interest rates lead to higher consumer spending, eating too much sugar causes obesity, etc.

For causality we have two directions to research:

Test how strong a causal relationship between variables variables is
Model causal relationship between between variables

Treatment

Treatment is a specific intervention, action, or exposure that is being studied for its causal effect on an outcome. We want to know if there is a causal relationship between variable X and Y, a direct and simple way is to change value of variable X, like making X=0 or X=1, and observe value of variable Y. This action of making X=0 or X=1, is treatment.

For example, in a medical study, a treatment might be a drug or a surgical procedure that is being evaluated for its effectiveness in treating a particular disease or condition. In the study individuals can be segmented into two groups: treatment group(use the drug, X=1) and control group(no drug, X=0).

Then we have another questions, that is, how to measure the effectiveness of a drug? How to measure the impact of a teaching method?

In causal inference, we call the effectiveness and impact treatment effect. A common way to measure treatment effect is ATE(average treatment effect). ATE is used to quantify the causal effect of a treatment on an outcome. It is defined as the average difference in the outcome between two groups: one group that receives the treatment and another group that does not receive the treatment.

Mathematically, the ATE is calculated as:

where Y(1) is the outcome for the treatment group, and Y(0) is the outcome for the control group. E[] denotes the expected value, which is calculated by averaging over all possible values of the confounding variables.

ATE is a useful measure because it provides a single summary statistic that quantifies the causal effect of the treatment on the outcome. However, it assumes that the causal effect of the treatment is the same for all individuals in the population, which may not always be true.

Treatment effect estimate

Let’s take an example with a dataset from Kaggle. This dataset is about breast cancer, which records ‘Age at Diagnosis’, ‘Patient’s Vital Status’ and some other medical indicators. We want to estimate the treatment, PR Status’s ATE on outcome Patient’s Vital Status.

import pandas as pd
from causalinference import CausalModel

data = pd.read_csv("https://raw.githubusercontent.com/rrRIOoo/data_cache/main/Breast%20Cancer%20METABRIC.csv")
data.dropna(axis=0, how='any',inplace=True)
data.replace('Living',1,inplace=True)
data.replace('Died of Other Causes',1,inplace=True)
data.replace('Died of Disease',0,inplace=True)
data.replace('Positive',1,inplace=True)
data.replace('Negative',0,inplace=True)

Y = data['Patient\'s Vital Status'].values
D = data['PR Status'].values
X = data[['Age at Diagnosis', 'Neoplasm Histologic Grade','Lymph nodes examined positive','Mutation Count','Nottingham prognostic index']].values
model = CausalModel(Y,D,X)

# Use OLS
model.est_via_ols()
print(model.estimates)

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.035      0.030      1.171      0.241     -0.023      0.093
           ATC      0.021      0.032      0.662      0.508     -0.041      0.083
           ATT      0.048      0.031      1.510      0.131     -0.014      0.109

In this snippet we construct model with OLS and calculate ATE. The ATE value is relatively small, which implies a weak causal relationship.

However, there are still problems in it. In our model we have several features, ‘Age at Diagnosis’, ‘Neoplasm Histologic Grade’, etc. If someone twenty-year-old still lives while some sixty-year-old dies, can we conclude that all effect comes from the treatment ‘PR Status’?

Of course not. A younger woman usually has stronger immune system than an older woman, which may keep her alive longer from diagnosis. We want to extinguish the effect from other factors, and a common method is propensity score matching.

Propensity score matching

Propensity score matching (PSM) is a method used in causal inference to reduce selection bias. It involves matching treatment and control units based on their propensity score, which is the probability of receiving the treatment given their observed covariates.

The goal of PSM is to create a balance between the treatment and control groups in terms of observed covariates, so that the groups are comparable and any differences in the outcome variable can be attributed to the treatment. The PSM approach is based on the assumption that, conditional on the propensity score, the treatment and control units are exchangeable.

We can do PSM in 4 steps:

Estimate the propensity score: Use a logistic regression model to estimate the propensity score for each individual in the sample. The propensity score is the predicted probability of receiving the treatment, given the individual’s observed covariates.
Match treatment and control units: Match treatment and control units based on their propensity scores. This can be done using various matching algorithms, such as nearest-neighbor matching, caliper matching, or kernel matching.
Assess balance: Evaluate the balance between the treatment and control groups in terms of observed covariates. This can be done by comparing the means or distributions of covariates in the two groups before and after matching.
Estimate treatment effect: Estimate the causal effect of the treatment on the outcome variable using the matched data. This can be done using various methods, such as a t-test or regression analysis.

import pandas as pd
from causalinference import CausalModel

data = pd.read_csv("q.csv")
data.dropna(axis=0, how='any',inplace=True)
data.replace('Living',1,inplace=True)
data.replace('Died of Other Causes',1,inplace=True)
data.replace('Died of Disease',0,inplace=True)
data.replace('Positive',1,inplace=True)
data.replace('Negative',0,inplace=True)

Y = data['Patient\'s Vital Status'].values
D = data['PR Status'].values
X = data[['Age at Diagnosis', 'Neoplasm Histologic Grade','Lymph nodes examined positive','Mutation Count','Nottingham prognostic index']].values
model = CausalModel(Y,D,X)

# Use PSM
model.est_via_matching()
print(model.estimates)

Treatment Effect Estimates: Matching
                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.043      0.047      0.908      0.364     -0.050      0.136
           ATC      0.058      0.054      1.070      0.285     -0.048      0.163
           ATT      0.030      0.055      0.537      0.591     -0.079      0.138

We can see the ATE value gets larger, which means in fact the treatment has more effect than we think in at the beginning.

Structural equation modeling

Now let’s move on to another topic, model causal relationship. In this part we will use SEM to do it.

Structural Equation Modeling (SEM) is a statistical technique used to model relationships between variables, including both direct and indirect effects. SEM can be used to test complex theoretical models and hypotheses by estimating the relationships between latent variables and their observed indicators.

The basic idea behind SEM is to represent a complex system of relationships as a set of linear equations. The equations are represented as a set of variables, which can be either latent (unobserved) or observed.

In summary, there are three types of variables

latent variable: cannot be observed directly, such as confidence, happiness, but can be represented by other indicators.
manifest variable: can be observed directly.
error term variable: represent the degree of measurement error or other sources of unexplained variance in the manifest variables

There are two types of models

Measurement model

Measurement model is a statistical model that specifies the relationships between the latent variables and the observed (manifest) variables.

The purpose of the measurement model is to describe how the latent variables are measured by the manifest variables. The measurement model assumes that the manifest variables are imperfect indicators of the underlying latent constructs, and that there is some amount of measurement error associated with each manifest variable.

The measurement model includes the specification of the factor loadings, which represent the strength and direction of the relationship between each manifest variable and its corresponding latent variable. The factor loadings can be thought of as the regression coefficients that represent the amount of variance in the manifest variable that is accounted for by the corresponding latent variable.

The measurement model also includes the specification of the residual variances, which represent the amount of measurement error or unexplained variance that is associated with each manifest variable. The residual variances are typically represented as error terms in the SEM.

Structural model

It is a statistical model that specifies the relationships between the latent variables and is based on a set of theoretical hypotheses about the causal relationships among the variables.

The structural model describes the causal relationships between the latent variables, as well as the direct and indirect effects of one variable on another. It specifies the regression coefficients that represent the strength and direction of the relationships between the latent variables, as well as any hypothesized direct effects of one manifest variable on another.

The structural model includes the specification of the path coefficients, which represent the direct and indirect effects of one variable on another. The path coefficients can be thought of as the regression coefficients that represent the amount of variance in the dependent variable that is accounted for by the independent variable.

The relationships between the variables are represented as arrows between the nodes. The arrows indicate the direction of causality between the variables, and can be either direct or indirect.

The basic steps of SEM are as follows:

Develop a theoretical model: The first step is to develop a theoretical model that represents the hypothesized causal relationships between variables. This model should be based on prior knowledge and research in the field.
Collect data: The next step is to collect data on the variables of interest. The data should be sufficient to estimate the parameters of the SEM model.
Specify the model: The SEM model should be specified by specifying the relationships between the variables, including direct and indirect effects. The model should also include error terms to account for measurement error in the observed variables.
Estimate the model: The SEM model can be estimated using maximum likelihood estimation or Bayesian methods. The goodness of fit of the model should be assessed using fit indices.
Evaluate the model: Once the model is estimated, it can be evaluated by examining its goodness of fit, model parameters, and causal effects. Common indicators include Chi-square, RMSEA, the smaller they are, the better this model fits.

Here we can use another dataset from Kaggle as an example. This dataset is about wine. It contains several indicators like ‘volatile acidity’, ‘residual sugar’, ‘pH’ and so on. The last column is ‘quality’. What we want to do is to construct a model representing causal relationship between quality and the indicators.

import semopy
import pandas as pd
import graphviz
# Load the data
data = pd.read_csv('https://raw.githubusercontent.com/rrRIOoo/data_cache/main/winequality-red.csv')
data.rename(columns={'fixed acidity':'fixed_acidity',
                     'volatile acidity':'volatile_acidity',
                     'citric acid':'citric_acid',
                     'residual sugar':'residual_sugar',
                     'free sulfur dioxide':'free_sulfur_dioxide',
                     'total sulfur dioxide':'total_sulfur_dioxide'},inplace=True)

# Specify the SEM model
model = '''
    # Define latent variables
    Latent1 =~ fixed_acidity + volatile_acidity
    Latent2 =~ free_sulfur_dioxide + total_sulfur_dioxide 
    Latent3 =~ residual_sugar
    Latent4 =~ alcohol
    Latent5 =~ quality
    
    # Define relationships
    
    Latent5 ~ Latent1 + Latent2 + Latent3 + Latent4
    
'''

# Estimate the model
model_obj = semopy.Model(model)
result = model_obj.fit(data)

# Evaluate the model
print(result)

# Visualize the model
semopy.semplot(model_obj, filename='model.png')

Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 21.114
Number of iterations: 92
Params: -0.096 0.160 2.319 -0.084 -0.040 -279.984 -27.315 -0.000 -0.000 0.436 0.388 -0.515 -0.141 0.029 0.007 0.000 540.776 30.980 0.038 0.316 0.021 0.023 0.309 0.002 1.639 1.833 55.285 1.408 -0.635 0.032 0.515 0.638 0.289

For the theoretical SEM model we pick some useful variables by trying and put variables which probably have similar effect together as latent variables, for example, the two types of acidity, and add up latent variables.

semopy.calc_stats(model_obj)

      DoF DoF Baseline chi2 chi2 p-value chi2 Baseline CFI GFI AGFI NFI TLI RMSEA AIC BIC LogLik
Value 4 21 227.414391 0.0 2052.564238 0.890028 0.889205 0.418325 0.889205 0.422649 0.186955 47.715554 176.766763 0.142223

Conclusion

In this article we talk about characteristics of causality, and two directions to research causality: testing and modeling. For testing we introduce PSM, which matches treatment and control units based on their propensity score. And we use SEM to do modeling, which includes two types of models, three types of models, and represent relationships as a set of linear equations. We also take two datasets from Kaggle to make our explanation more specific.

Causality is very important because it helps us to understand the world around us and make informed decisions about how to shape it. By identifying causal relationships, we can explain why certain events occur and how we can intervene to produce desired outcomes. Keep exploring!

Quiz

list three key components of causality.
temporal precedence, empirical association, the absence of alternative explanations, plausible mechanism, coherence
what is treatment?
treatment is a specific intervention, action, or exposure that is being studied for its causal effect on an outcome.
what is ATE?
ATE(average treatment effect)
ATE = E[Y(1)-Y(0)]
why PSM?
to create a balance between the treatment and control groups in terms of observed covariates, so that the groups are comparable and any differences in the outcome variable can be attributed to the treatment.
PSM steps?
1. estimate the propensity score
2. match treatment and control units
3. assess balance
4. estimate treatment effect
basic idea behind SEM?
to represent a complex system of relationships as a set of linear equations. The equations are represented as a set of variables, which can be either latent (unobserved) or observed.
three types of variables in SEM?
manifest variable, latent variable, error term variable
two types of models in SEM?
structural model, measurement model
what is latent variable?
variable which cannot be observed directly but can be represented by other indicators.
how to do SEM?
1. develop a theoretical model
2. collect data
3. specify the model
4. estimate the model
5. evaluate the model

Exercise

There is a dataset about diabetes from Kaggle for exercise. This dataset contains some body signs like hypertension of each individual, and whether the individual gets diabetes.

PSM

Estimate the propensity score
Match treatment and control units
Assess balance
Estimate treatment effect

import pandas as pd
from causalinference import CausalModel

data = pd.read_csv("diabetes_prediction_dataset.csv")
data.dropna(axis=0, how='any',inplace=True)


Y = data['diabetes'].values
D = data['hypertension'].values
X = data[['age','heart_disease','bmi','blood_glucose_level']].values
model = CausalModel(Y,D,X)

model.est_via_matching()
print(model.estimates)

Treatment Effect Estimates: Matching
                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.054      0.031      1.749      0.080     -0.006      0.114
           ATC      0.053      0.033      1.589      0.112     -0.012      0.117
           ATT      0.067      0.006     10.385      0.000      0.054      0.079

SEM

Develop a theoretical model
Collect data
Specify the model
Estimate the model
Evaluate the model

import semopy
import pandas as pd

# Load the data
data = pd.read_csv('https://raw.githubusercontent.com/rrRIOoo/data_cache/main/diabetes_prediction_dataset.csv')


# Specify the SEM model
model = '''
    Latent1 =~ hypertension 
    Latent2 =~ heart_disease
    Latent3 =~ blood_glucose_level + bmi + HbA1c_level
    Latent4 =~ diabetes
    # Define relationships
    
    Latent4 ~ Latent1 + Latent2 + Latent3 
    
'''

# Estimate the model
model_obj = semopy.Model(model)
result = model_obj.fit(data)

# Evaluate the model
print(result)
print()
print()

# Visualize the model
semopy.semplot(model_obj, filename='model.png')

Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully
Objective value: 0.051
Number of iterations: 379
Params: 0.081 0.026 14.057 15.214 -0.877 0.028 0.726 2.387 43.950 1638.169 0.015 1.136 0.054 0.006 0.959 0.009 0.252 15.371

semopy.calc_stats(model_obj)

      DoF DoF Baseline chi2 chi2 p-value chi2 Baseline CFI GFI AGFI NFI TLI RMSEA AIC BIC LogLik
Value 3 15 5051.298404 0.0 50613.22897 0.900228 0.900198 0.50099 0.900198 0.501139 0.129722 35.898974 207.131632 0.050513