Crash Course in Causality

Published in

AI Skunks

14 min readApr 27, 2023

What is Causality?

Causality is the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first. In other words, causality refers to the principle that an event or action produces a certain response or outcome. It is the idea that one event is responsible for bringing about another event or a particular result.

Causality is an important concept in many fields, including philosophy, science, and statistics. In science, causality is often established through experiments or empirical observations that demonstrate a clear cause-and-effect relationship between variables. In philosophy, the study of causality is known as metaphysics and seeks to understand the fundamental nature of causality and its implications for reality.

Understanding causality is important in many practical contexts as well. For example, in medicine, researchers investigate the causes of diseases in order to develop effective treatments. In law, causality is a critical element in determining liability and responsibility for damages in legal disputes.

Causality in terms of Data

In data science and statistics, causality refers to the relationship between two variables where one variable (the cause) directly influences the other variable (the effect). The cause-and-effect relationship can be established through experiments, where a researcher manipulates the cause variable and observes the effect variable, or through observational studies, where the researcher observes the relationship between the cause and effect variables.

Establishing causality in data is important because it allows us to make accurate predictions and identify the factors that drive certain outcomes. However, establishing causality can be challenging because there may be confounding variables or other factors that influence the outcome.

To establish causality in data, researchers often use methods such as randomized controlled trials, natural experiments, or quasi-experimental designs that attempt to isolate the effect of the cause variable from other factors that might influence the outcome. Additionally, statistical methods such as regression analysis can be used to estimate the strength and direction of the relationship between the cause and effect variables.

Algorithms to identify Causal Relationships

There are several types of algorithms used in causality, including:

Bayesian Networks: Bayesian Networks are graphical models that represent the probabilistic relationships between variables. They can be used to identify the causal relationships between variables and to estimate the strength and direction of the causal effect.
Structural Equation Modeling (SEM): SEM is a statistical method that allows researchers to estimate the relationships between variables in a causal model. It can be used to test hypotheses about causal relationships between variables, and to estimate the magnitude and direction of the causal effect.
Propensity Score Matching: Propensity Score Matching is a statistical technique that is used to estimate the causal effect of a treatment or intervention. It involves matching individuals who received the treatment to individuals who did not receive the treatment based on their propensity score, which is a measure of the likelihood of receiving the treatment based on their observed characteristics.
Granger Causality: Granger Causality is a statistical method that is used to test whether one time series variable can predict another time series variable. It is based on the idea that if variable X can predict variable Y, then X is a causal factor of Y.

These algorithms and techniques are important in causality because they allow researchers to identify the causal relationships between variables and to estimate the causal effect of interventions or treatments. However, it is important to use caution when interpreting the results of these algorithms, as there may be confounding variables or other factors that influence the outcome.

Bayesian Networks

Suppose we want to understand the factors that influence whether a student will pass a test or not. We hypothesize that the student’s study habits, the difficulty of the test, and the student’s prior knowledge of the subject all play a role in whether the student passes the test or not.

We can represent this causal model using a Bayesian Network, which is a graphical model that represents the probabilistic relationships between variables. Here’s what the Bayesian Network might look like:

— — — — +
| Study |
| Habits |
+ — — — — +
|
|
v
+ — — — — — — -+
| Test |
| Difficulty |
+ — — — — — — -+
|
|
v
+ — — — — — — -+
| Prior |
| Knowledge |
+ — — — — — — -+
|
|
v
+ — — — — — — -+
| Pass |
| Test? |
+ — — — — — — -+

In this Bayesian Network, each variable is represented as a node, and the arrows between the nodes indicate the causal relationships between the variables.

We can use this Bayesian Network to make predictions about whether a student will pass the test or not. For example, if we know the student’s study habits, the difficulty of the test, and their prior knowledge of the subject, we can use the Bayesian Network to estimate the probability that the student will pass the test.

Additionally, we can use the Bayesian Network to identify the most important factors that influence whether a student passes the test or not. By examining the strength and direction of the arrows in the network, we can see which variables have the strongest causal influence on the outcome variable (i.e., whether the student passes the test or not). This information can be used to develop interventions or treatments to improve students’ performance on tests.

Let S represent the student’s study habits, T represent the difficulty of the test, K represent the student’s prior knowledge of the subject, and P represent whether the student passes the test or not. We can represent the causal relationships between the variables using conditional probability distributions. For example, the conditional probability of passing the test given the student’s study habits, the difficulty of the test, and their prior knowledge of the subject can be written as:

P(P | S, T, K)

Similarly, we can write conditional probability distributions for the other variables in the Bayesian Network:

P(T | S)

P(K | S)

P(P | T, K)

These conditional probability distributions represent the strength and direction of the causal relationships between the variables in the Bayesian Network. For example, the conditional probability distribution P(P | T, K) represents the causal effect of the difficulty of the test and the student’s prior knowledge on whether the student passes the test or not.

To make predictions using the Bayesian Network, we use Bayes’ rule to calculate the posterior probability of the outcome variable given the observed values of the other variables:

P(P | S, T, K) = (P(P) * P(S | P) * P(T | S) * P(K | S) * P(P | T, K)) / P(S, T, K)

Here, P(P) represents the prior probability of passing the test, which is the probability of passing the test without any knowledge of the student’s study habits, the difficulty of the test, or their prior knowledge. P(S | P), P(T | S), and P(K | S) represent the conditional probability distributions of the study habits, the difficulty of the test, and the prior knowledge given the outcome variable, respectively. P(P | T, K) represents the conditional probability distribution of passing the test given the difficulty of the test and the student’s prior knowledge.

By using this equation to calculate the posterior probability of the outcome variable, we can make predictions about whether the student will pass the test or not, given their study habits, the difficulty of the test, and their prior knowledge.

Structural Equation Modeling (SEM)

Structural Equation Modeling (SEM) is a statistical technique used to test and estimate the relationships between multiple variables. It is a type of causal modeling that can help us understand the complex interrelationships between variables and their effects on a particular outcome.

Here is an example of how SEM can be used:

Suppose we want to understand the factors that influence a student’s academic performance. We hypothesize that a student’s performance is affected by their level of intelligence, their motivation, their study habits, and their socio-economic background.

We can use SEM to test this hypothesis and estimate the strength and direction of the causal relationships between these variables. To do this, we would start by specifying a theoretical model that represents our hypothesis about the relationships between the variables.

Here’s what the theoretical model might look like:

— — — — -+
| Intelligence |
+ — — — — -+
|
|
v
+ — — — — -+
| Motivation |
+ — — — — -+
|
|
v
+ — — — — -+
| Study |
| Habits |
+ — — — — -+
|
|
v
+ — — — — — — — — -+
| Academic |
| Performance |
+ — — — — — — — — -+
|
|
v
+ — — — — — — — — -+
| Socio-economic |
| Background |
+ — — — — — — — — -+

In this SEM model, each variable is represented as a latent variable (i.e., a variable that is not directly observed) with multiple indicators or observed variables (e.g., test scores, survey responses) that are used to measure the latent variable. The arrows between the variables represent the hypothesized causal relationships.

We can estimate the parameters of this model using SEM software. The estimated parameters represent the strength and direction of the relationships between the variables, as well as the amount of variance in each variable that is accounted for by the other variables in the model.

For example, the estimated parameter for the relationship between motivation and academic performance would indicate how much of the variance in academic performance is explained by variation in motivation, after accounting for the effects of the other variables in the model.

SEM can also be used to test hypotheses about the model’s fit to the data. For example, we can use goodness-of-fit statistics to assess whether the model fits the data well or whether there are significant deviations from the hypothesized relationships between the variables.

Overall, SEM is a powerful technique that can help us test complex hypotheses about the relationships between variables and understand the mechanisms underlying a particular outcome.

Certainly! In SEM, the relationships between latent variables and observed variables are modeled using structural equations. The goal is to estimate the coefficients of these equations that best fit the observed data.

Here’s an example of how we can represent the structural equations mathematically for the SEM model we discussed earlier:

Let I represent intelligence, M represent motivation, SH represent study habits, A represent academic performance, and SE represent socio-economic background. Let X1, X2, X3, Y1, and Y2 represent the observed variables that are used to measure these latent variables.

The structural equations for this model can be written as:

X1 = λ1I + ε1

X2 = λ2M + ε2

X3 = λ3SH + ε3

Y1 = β1I + β2M + β3SH + β4A + ε4

Y2 = γ1SE + γ2A + ε5

where λ1, λ2, and λ3 represent the coefficients of the measurement models for I, M, and SH, respectively, and ε1, ε2, and ε3 represent the error terms for the observed variables X1, X2, and X3.

Similarly, β1, β2, and β3 represent the regression coefficients of the structural equation model for the relationship between I, M, and SH with A, and β4 represents the direct effect of A on Y1. Finally, γ1 represents the regression coefficient of the structural equation model for the relationship between SE and Y2, and γ2 represents the direct effect of A on Y2. ε4 and ε5 represent the error terms for the observed variables Y1 and Y2, respectively.

The parameters of these equations can be estimated using maximum likelihood estimation or other estimation techniques. Once the parameters are estimated, we can use various fit indices to evaluate how well the model fits the data.

By estimating the coefficients of these structural equations, SEM can provide insights into the relationships between latent variables and observed variables and how they jointly influence the outcome of interest.

Propensity Score Matching

Suppose we want to estimate the effect of a new medication on a particular health outcome. We have a sample of 100 patients, 50 of whom have been prescribed the medication and 50 who have not. However, we suspect that the patients who received the medication are systematically different from those who did not, which could bias our estimate of the treatment effect.

To control for this selection bias, we can use PSM to match each patient in the treatment group to a similar patient in the control group based on their propensity scores. Here’s how we might do this:

We first need to estimate the propensity scores for each patient, which are the probabilities of receiving the medication given a set of observed covariates, such as age, sex, medical history, and severity of the health condition. We can estimate the propensity scores using logistic regression, where the dependent variable is whether the patient received the medication or not, and the independent variables are the observed covariates.
Next, we can use a matching algorithm, such as nearest-neighbor matching, to match each patient in the treatment group to a similar patient in the control group based on their propensity scores. The matching algorithm ensures that each patient in the treatment group is matched to a control patient with a similar propensity score.
After matching, we can compare the health outcome of interest between the matched pairs of patients to estimate the treatment effect. This estimate is less biased than simply comparing the outcomes of the treatment and control groups, as it controls for the selection bias due to differences in observed covariates.

For example, suppose we find that the average health outcome of the treatment group is 2 points higher than that of the control group. However, after PSM, we find that the average health outcome of the matched pairs of patients is only 1 point higher in the treatment group than in the control group. This suggests that the true treatment effect may be smaller than what we initially estimated, and that the difference in outcomes between the treatment and control groups may be partially explained by differences in observed covariates.

Overall, PSM is a powerful tool for reducing bias in observational studies, and can help us make more accurate inferences about the effects of treatments or interventions on outcomes of interest.

Certainly! Propensity Score Matching (PSM) involves estimating the probability of receiving a treatment (e.g., medication) given a set of observed covariates, known as the propensity score, and then matching individuals who received the treatment with individuals who did not receive the treatment based on similar propensity scores.

Mathematically, the propensity score can be estimated using a logistic regression model, which can be written as:

logit(P(T=1|X)) = β0 + β1X1 + β2X2 + … + βkXk

where T is a binary treatment indicator (T=1 for treated, T=0 for untreated), X1, X2, …, Xk are the observed covariates, and β0, β1, β2, …, βk are the coefficients to be estimated.

The logistic regression model estimates the log odds of receiving the treatment as a linear combination of the observed covariates. The estimated propensity score for each individual is then the predicted probability of receiving the treatment based on their observed covariates, which is obtained by applying the inverse logit function to the estimated log odds:

P(T=1|X) = exp(β0 + β1X1 + β2X2 + … + βkXk) / [1 + exp(β0 + β1X1 + β2X2 + … + βkXk)]

Once the propensity scores are estimated for each individual, we can use a matching algorithm to match individuals who received the treatment with individuals who did not receive the treatment based on similar propensity scores. The choice of matching algorithm can vary, but a common approach is nearest-neighbor matching, which matches each treated individual with the untreated individual with the closest propensity score.

After matching, we can compare the outcomes of the treated and untreated individuals who were matched on similar propensity scores to estimate the treatment effect. The treatment effect estimate can be calculated using a variety of methods, such as difference-in-means or regression models that account for the matched design.

Overall, PSM is a powerful statistical technique that helps to reduce selection bias in observational studies, and it relies on the estimation of the propensity score to identify similar individuals who received different treatments.

Granger Causality

Suppose we want to investigate whether the stock prices of two companies, Company A and Company B, are causally related. We have daily stock price data for both companies over a period of one year.

To test for Granger causality, we can follow these steps:

We first need to specify a model that captures the dynamics of the two time series variables. A common approach is to use a Vector Autoregressive (VAR) model, which models each variable as a function of its own past values and the past values of the other variable. For example, we could specify a VAR(2) model as follows:

A_t = α_0 + α_1 A_{t-1} + α_2 A_{t-2} + β_1 B_{t-1} + β_2 B_{t-2} + ε_{1t}
B_t = γ_0 + γ_1 A_{t-1} + γ_2 A_{t-2} + δ_1 B_{t-1} + δ_2 B_{t-2} + ε_{2t}

where A_t and B_t are the stock prices of Company A and Company B at time t, respectively, α_i, β_i, γ_i, and δ_i are the coefficients to be estimated, ε_{1t} and ε_{2t} are the error terms, and the subscripts t-1 and t-2 denote lagged values.

Next, we need to estimate the coefficients of the VAR model using the available data. We can use techniques such as Maximum Likelihood Estimation (MLE) or Bayesian methods to estimate the coefficients.

After estimating the coefficients, we can test for Granger causality between the two variables. A common approach is to use the F-test, which tests whether the lagged values of one variable provide additional information beyond the lagged values of the other variable in predicting the future values of the other variable. For example, to test whether Company A Granger-causes Company B, we can perform the following F-test:

H_0: β_1 = β_2 = 0 (i.e., Company A does not Granger-cause Company B)
H_1: at least one β_i ≠ 0 (i.e., Company A Granger-causes Company B)

If the F-test statistic exceeds the critical value at a given significance level, we reject the null hypothesis and conclude that Company A Granger-causes Company B.

For example, suppose we estimate the VAR(2) model as described above and perform the F-test for Granger causality from Company A to Company B. If the F-test statistic exceeds the critical value at the 5% significance level, we can conclude that Company A Granger-causes Company B, indicating that past values of Company A provide additional information beyond past values of Company B in predicting the future values of Company B.

Overall, Granger Causality is a powerful statistical method for investigating causal relationships between time series variables, and it relies on fitting a VAR model and testing for the significance of the coefficients to identify the direction of causality.

Suppose we have two time series variables, X and Y, and we want to test for Granger causality from X to Y.

We can represent the two time series as follows:

X = {x1, x2, x3, …, xn} Y = {y1, y2, y3, …, yn}

where xi and yi represent the values of X and Y at time i, respectively, and n is the total number of observations.

We can model the relationship between X and Y using a Vector Autoregression (VAR) model of order p, which takes the form:

Xt = α_0 + α_1 X_{t-1} + α_2 X_{t-2} + … + α_p X_{t-p} + β_1 Y_{t-1} + β_2 Y_{t-2} + … + β_p Y_{t-p} + ε_{1t}

Yt = γ_0 + γ_1 X_{t-1} + γ_2 X_{t-2} + … + γ_p X_{t-p} + δ_1 Y_{t-1} + δ_2 Y_{t-2} + … + δ_p Y_{t-p} + ε_{2t}

where Xt and Yt are the values of X and Y at time t, respectively, α_i, β_i, γ_i, and δ_i are the coefficients to be estimated, ε_{1t} and ε_{2t} are the error terms, and the subscripts t-1, t-2, …, t-p denote the lagged values of the variables up to the order p.

We estimate the coefficients of the VAR model using techniques such as Maximum Likelihood Estimation (MLE) or Bayesian methods.

Once we have estimated the coefficients, we can test for Granger causality from X to Y using an F-test, which tests whether the lagged values of X provide additional information beyond the lagged values of Y in predicting the future values of Y.

The null hypothesis for the F-test is:

H_0: β_1 = β_2 = … = β_p = 0 (i.e., X does not Granger-cause Y)

The alternative hypothesis is:

H_1: at least one β_i ≠ 0 (i.e., X Granger-causes Y)

If the F-test statistic exceeds the critical value at a given significance level, we reject the null hypothesis and conclude that X Granger-causes Y.

In summary, Granger causality is a statistical method that involves estimating the coefficients of a VAR model and testing for the significance of the coefficients to determine the direction of causality between time series variables.