Crash Course on Causality

Published in

AI Skunks

15 min readApr 23, 2023

What is causality?

Causality is the relationship between cause and effect. In other words, it’s the idea that one thing (the cause) leads to another thing (the effect). Understanding causality is important in many fields, including medicine, economics, and machine learning.

Why is causality important in machine learning?

Causality is important in machine learning because it helps us understand the underlying relationships between variables in our data. Without understanding causality, we might make incorrect conclusions or predictions based on correlations that don’t actually represent causal relationships.

Correlation vs. causation

Correlation refers to a statistical relationship between two variables. When two variables are correlated, they tend to move together in some way. However, correlation does not necessarily imply causation. Just because two variables are correlated, it doesn’t mean that one causes the other.

Let’s start with correlation. As mentioned earlier, correlation refers to the degree of linear relationship between two variables, and the most commonly used measure of correlation is the Pearson correlation coefficient, denoted by “r”. The formula for calculating the Pearson correlation coefficient is as follows:

r = (nΣxy — ΣxΣy) / sqrt((nΣx² — (Σx)²) * (nΣy² — (Σy)²))

where:

n is the number of observations
Σxy is the sum of the products of each observation of x and y
Σx and Σy are the sums of x and y, respectively
Σx² and Σy² are the sums of the squares of x and y, respectively

The Pearson correlation coefficient ranges from -1 to +1, with a value of -1 indicating a perfect negative correlation, a value of +1 indicating a perfect positive correlation, and a value of 0 indicating no correlation.

Now, let’s move on to causation. Causation is more complex and difficult to quantify mathematically, but one way to approach it is through the use of conditional probabilities. The conditional probability of an event A given an event B is denoted by P(A|B) and represents the probability of A occurring given that B has occurred.

In the context of causation, we can define the causal effect of one variable (X) on another variable (Y) as the difference in the conditional probabilities of Y given that X has occurred and Y given that X has not occurred. Mathematically, this can be expressed as follows:

P(Y|do(X)) — P(Y|do(~X))

where:

P(Y|do(X)) represents the probability of Y occurring if we intervene to set X to a specific value (denoted by “do(X)”), regardless of the actual value of X in the observed data
P(Y|do(~X)) represents the probability of Y occurring if we intervene to set X to a specific value other than the actual value (denoted by “do(~X)”), regardless of the actual value of X in the observed data

This formula represents the difference in the outcome of Y when we actively intervene to change the value of X, as opposed to simply observing the relationship between X and Y in the data. This difference can be interpreted as the causal effect of X on Y.

In summary, while correlation can be quantified using the Pearson correlation coefficient formula, causation is a more complex concept that can be approached through the use of conditional probabilities and the concept of causal effect.

Types of causality

There are two main types of causality: direct causality and indirect causality. Direct causality occurs when one thing directly causes another thing. Indirect causality occurs when one thing indirectly causes another thing through a chain of intermediate causes.

Causal inference

Causal inference is the process of determining whether one variable causes another variable. In machine learning, we can use causal inference techniques to identify causal relationships between variables in our data.

One approach to causal inference is through the use of graphical models, which can be represented using directed acyclic graphs (DAGs). In this context, the mathematical formulae for causal inference involve the use of conditional probabilities and the structure of the DAG.

Let’s start with the concept of a DAG. A DAG is a graphical representation of the causal relationships between variables. The nodes in the graph represent variables, and the edges represent the causal relationships between them. For example, if variable A causes variable B, there would be an arrow pointing from A to B in the graph.

The structure of the DAG can be used to determine conditional independence relationships between variables. In particular, if there is no direct causal path between two variables (i.e., they are not connected by an arrow in the DAG), then they are conditionally independent given the variables that are upstream of them in the DAG. This is known as the d-separation criterion.

Using the DAG and the d-separation criterion, we can derive mathematical formulas for inferring causal effects. One commonly used formula is the do-calculus formula, which can be used to calculate the causal effect of an intervention on a variable of interest. The do-calculus formula is as follows:

P(Y|do(X)) = Σ_z P(Y|X, Z)P(Z)

where:

P(Y|do(X)) represents the probability of Y occurring if we intervene to set X to a specific value, regardless of the actual value of X in the observed data
P(Y|X, Z) represents the probability of Y occurring given values of X and Z
P(Z) represents the probability distribution of Z

This formula represents the total effect of X on Y, taking into account any mediating variables Z that lie on the causal path between X and Y. By setting X to a specific value and observing the resulting probability of Y, we can infer the causal effect of X on Y.

Another commonly used formula for causal inference is the backdoor adjustment formula, which can be used to control for confounding variables in observational studies. The backdoor adjustment formula is as follows:

P(Y|do(X)) = Σ_z P(Y|X, Z)P(Z|W)

where:

P(Y|do(X)) represents the probability of Y occurring if we intervene to set X to a specific value, regardless of the actual value of X in the observed data
P(Y|X, Z) represents the probability of Y occurring given values of X and Z
P(Z|W) represents the probability distribution of Z conditional on a set of variables W that block all backdoor paths between X and Y in the DAG

This formula allows us to estimate the causal effect of X on Y in observational studies, by controlling for confounding variables that may affect the relationship between X and Y.

In summary, causal inference involves the use of graphical models and conditional probabilities to infer causal relationships between variables. The do-calculus and backdoor adjustment formulas are commonly used mathematical tools in the field of causal inference.

Counterfactuals

A counterfactual is a hypothetical scenario that explores what would have happened if something had been different. In causal inference, we often use counterfactuals to explore what would have happened if the cause had not occurred.

Counterfactual inference is a mathematical framework used to reason about what would have happened if a particular event had not occurred. In the context of causal inference, counterfactual inference is used to estimate the causal effect of an intervention, by comparing what actually happened to what would have happened if the intervention had not been performed. The mathematics behind counterfactual inference involves the use of potential outcomes and the conditional probability framework.

Let’s start with the concept of potential outcomes. A potential outcome is the value that a variable would take under a specific treatment condition. For example, if we are interested in the effect of a new drug on a patient’s blood pressure, the potential outcome under the drug treatment would be the patient’s blood pressure if they received the drug, while the potential outcome under the control condition would be the patient’s blood pressure if they did not receive the drug.

Using the potential outcomes framework, we can define the counterfactual outcome as the value of a variable that would have occurred if a particular event had not occurred. For example, the counterfactual outcome for a patient’s blood pressure under the drug treatment would be their blood pressure if they had not received the drug.

The mathematics behind counterfactual inference involves the use of conditional probabilities. Specifically, we can define the conditional probability of an outcome Y given a treatment X as follows:

P(Y|X) = P(Y|X=1)P(X=1) + P(Y|X=0)P(X=0)

where:

P(Y|X) represents the probability of outcome Y given treatment X
P(Y|X=1) represents the probability of outcome Y under the treatment condition
P(X=1) represents the probability of receiving the treatment
P(Y|X=0) represents the probability of outcome Y under the control condition
P(X=0) represents the probability of not receiving the treatment

This formula allows us to estimate the causal effect of the treatment on the outcome, by comparing the probability of the outcome under the treatment condition to the probability of the outcome under the control condition.

In the context of counterfactual inference, we can use this formula to estimate the counterfactual outcome by setting X to 0 and observing the resulting probability of Y:

P(Y|X=0) = P(Y|X) — P(Y|X=1)P(X=1) / P(X=0)

This formula allows us to estimate the value of the outcome that would have occurred if the treatment had not been performed.

In summary, counterfactual inference is a mathematical framework used to reason about what would have happened if a particular event had not occurred. The mathematics behind counterfactual inference involves the use of potential outcomes and conditional probabilities, which allow us to estimate the causal effect of an intervention and the counterfactual outcome that would have occurred in the absence of the intervention.

Confounding variables

A confounding variable is a variable that is related to both the cause and the effect, making it difficult to determine whether the cause is actually responsible for the effect. In machine learning, we can use techniques like propensity score matching to control for confounding variables.

In statistics and causal inference, confounding refers to a situation where an extraneous variable is associated with both the dependent variable and the independent variable, and is not accounted for in the analysis. The result is a spurious association between the independent variable and the dependent variable, which may be mistaken for a causal effect. The mathematics behind confounding involves the use of regression analysis and the concept of conditional independence.

Let’s start with the concept of conditional independence. Two variables X and Y are conditionally independent given a third variable Z if the probability distribution of X is independent of Y, given Z.

Mathematically, we can express conditional independence as:
P(X|Y,Z) = P(X|Z)

This formula means that the probability of X given Y and Z is equal to the probability of X given Z, which implies that X and Y are independent given Z.

Using the concept of conditional independence, we can define a confounding variable as a variable that is associated with both the independent variable and the dependent variable, and is not independent of the independent variable given the confounding variable.

Mathematically, we can express the presence of a confounding variable as:
P(Y|X,Z) ≠ P(Y|X)

This formula means that the probability of Y given X and Z is not equal to the probability of Y given X, which implies that the relationship between X and Y is confounded by the variable Z.

To address confounding, we can use regression analysis to control for the confounding variable. Specifically, we can use multiple regression analysis to model the relationship between the independent variable, the dependent variable, and the confounding variable. This involves estimating the coefficients of the regression equation, which allow us to adjust for the effect of the confounding variable on the dependent variable.

Mathematically, the multiple regression equation can be expressed as:
Y = β0 + β1X + β2Z + ε

where:

Y is the dependent variable
X is the independent variable of interest
Z is the confounding variable
β0 is the intercept term
β1 is the coefficient of X, representing the effect of X on Y after adjusting for Z
β2 is the coefficient of Z, representing the effect of Z on Y
ε is the error term

By estimating the coefficients β1 and β2, we can control for the effect of the confounding variable Z on the relationship between X and Y.

In summary, confounding is a situation where an extraneous variable is associated with both the dependent variable and the independent variable, and is not accounted for in the analysis. The mathematics behind confounding involves the use of regression analysis and the concept of conditional independence, which allows us to model the relationship between the independent variable, the dependent variable, and the confounding variable, and control for the effect of the confounding variable on the dependent variable.

Causal graphs

A causal graph is a graphical representation of the causal relationships between variables in our data. Causal graphs can help us visualize and understand the complex causal relationships in our data.

Causal graphs, also known as causal Bayesian networks, are graphical representations of causal relationships between variables. They use directed acyclic graphs (DAGs) to represent the causal relationships between variables, where arrows represent causal links. The mathematics behind causal graphs involves the use of probability theory and graph theory.

The probability theory involved in causal graphs is based on the concept of conditional probability. The probability of an event A given an event B is written as P(A|B), which represents the probability of A occurring given that B has occurred.

Conditional probability can be calculated using the formula:
P(A|B) = P(A and B) / P(B)

This formula means that the probability of A given B is equal to the probability of A and B occurring together, divided by the probability of B occurring.

In causal graphs, the conditional probability of a variable given its parents in the graph is represented by a conditional probability table (CPT). The CPT lists the probabilities of each possible value of the variable given each combination of values of its parents.

The graph theory involved in causal graphs is based on the concept of Markovian assumptions, which states that a variable is conditionally independent of its non-descendants in the graph given its parents.

Mathematically, this can be expressed as:
P(X|non-descendants(X)) = P(X|parents(X))

This formula means that the probability of a variable X given its non-descendants is equal to the probability of X given its parents, which implies that X is independent of its non-descendants given its parents.

Using the Markovian assumptions, we can use a causal graph to identify the set of variables that need to be controlled for in order to estimate the causal effect of a variable on another variable. This is done by identifying the set of variables that are ancestors of the variable of interest and that are not descendants of the outcome variable.

Mathematically, we can use the do-calculus rules to estimate the causal effect of a variable on another variable in the presence of confounding variables. The do-calculus rules allow us to calculate the causal effect of an intervention on a variable, which involves setting the value of the variable to a specific value and calculating the resulting change in the outcome variable.

In summary, the mathematics behind causal graphs involves the use of probability theory and graph theory to represent the causal relationships between variables. Probability theory is used to represent the conditional probability of a variable given its parents in the graph, while graph theory is used to identify the set of variables that need to be controlled for in order to estimate the causal effect of a variable on another variable. The do-calculus rules can be used to estimate the causal effect of an intervention on a variable in the presence of confounding variables.

Below are some more Advanced Techniques in Causation:

Interventions

An intervention is a deliberate change to one variable that is intended to cause a change in another variable. In causal inference, we can use interventions to test whether a variable is actually causing another variable.

The mathematical framework for interventions is based on the do-calculus rules, which allow us to estimate the causal effect of an intervention on a variable while controlling for confounding variables.

The do-calculus rules are based on the concept of counterfactuals, which are hypothetical outcomes that would have occurred if a variable or set of variables had taken a different value. The notation for a counterfactual is denoted as Y_x, which represents the value of Y if X had been set to x.

The do-operator is used to represent an intervention, which changes the value of a variable to a specific value. The notation for an intervention is denoted as do(X=x), which represents the value of X being set to x.

The do-calculus rules provide a set of mathematical rules for estimating the causal effect of an intervention on a variable of interest, while controlling for confounding variables. The rules are based on the concept of “backdoor paths” and “frontdoor paths” in the causal graph.

The backdoor criterion states that a set of variables Z can control for confounding if it satisfies two conditions: (1) Z blocks all backdoor paths between the intervention variable X and the outcome variable Y, and (2) Z does not contain any descendants of X.

The frontdoor criterion states that a set of variables M can control for confounding if it satisfies two conditions: (1) M blocks all frontdoor paths between the intervention variable X and the outcome variable Y, and (2) all paths between X and M are unconfounded.

The mathematical formula for estimating the causal effect of an intervention on a variable of interest Y, while controlling for confounding variables Z, is given by:
P(Y|do(X=x),Z) = sum P(Y|X=x,Z,W)P(W|Z)

where P(Y|do(X=x),Z) represents the probability of Y given the intervention do(X=x) and confounding variables Z, P(Y|X=x,Z,W) represents the probability of Y given X=x, confounding variables Z, and any additional variables W that need to be controlled for, and P(W|Z) represents the probability of W given the confounding variables Z.

In summary, the mathematics behind interventions involves the use of the do-calculus rules to estimate the causal effect of an intervention on a variable of interest while controlling for confounding variables. The rules involve identifying backdoor and frontdoor paths in the causal graph and using them to estimate the effect of the intervention.

Structural equation modeling

Structural equation modeling is a statistical technique for modeling the relationships between variables in our data. It can be used for both causal inference and prediction.

In SEM, the observed variables are modeled as linear combinations of latent variables and error terms, and the relationships between variables are represented by paths or arrows. The latent variables are unobserved and are estimated using the observed variables.

The mathematical formula for a structural equation model can be represented as:
Σ = ΛΦΛ’ + Ψ

where Σ is the covariance matrix of the observed variables, Λ is the matrix of factor loadings (i.e., the coefficients that relate the latent variables to the observed variables), Φ is the matrix of factor covariances and variances, and Ψ is the matrix of unique variances (i.e., the variances of the error terms).

The SEM model can be estimated using maximum likelihood estimation, which involves finding the values of the parameters (i.e., the factor loadings, factor covariances and variances, and unique variances) that maximize the likelihood of observing the data.

SEM can also be used to test hypotheses about the relationships between variables, including the direct and indirect effects of variables on each other. The direct effect represents the effect of one variable on another, while controlling for other variables in the model. The indirect effect represents the effect of one variable on another, through one or more intervening variables.

The mathematical formula for estimating direct and indirect effects in SEM is based on the concept of path coefficients, which represent the strength and direction of the relationships between variables.

The formula for estimating the direct effect of variable X on variable Y can be represented as:
β_XY = (λ_YX * λ_YY) / (1 — λ_YY²)

where β_XY is the path coefficient for the direct effect of X on Y, λ_YX is the factor loading for X on the latent variable that affects Y, and λ_YY is the factor loading for Y on the latent variable that affects Y.

The formula for estimating the indirect effect of variable X on variable Y through variable Z can be represented as:
β_XY.Z = β_XZ * β_ZY

where β_XZ is the path coefficient for the effect of X on Z, and β_ZY is the path coefficient for the effect of Z on Y, while controlling for X.

In summary, the mathematics behind SEM involves modeling the relationships between observed and latent variables using a set of linear equations, estimating the parameters of the model using maximum likelihood estimation, and testing hypotheses about the relationships between variables using path coefficients and direct and indirect effects.

Bayesian networks

A Bayesian network is a probabilistic graphical model that represents the conditional dependencies between variables in our data. Bayesian networks can be used for causal inference and prediction, and they can be learned from data using various algorithms.

Bayesian networks are probabilistic graphical models that represent the dependencies between a set of random variables using a directed acyclic graph (DAG). The nodes of the DAG represent random variables, and the edges represent probabilistic dependencies between them.

In a Bayesian network, each node is associated with a conditional probability distribution that describes the probability of the node given its parent nodes. The joint probability distribution over all the nodes in the network can be calculated using the product rule of probability:

P(X_1, X_2, …, X_n) = ∏ P(X_i | parents(X_i))

where X_1, X_2, …, X_n are the random variables represented by the nodes of the network, and parents(X_i) denotes the set of parent nodes of node X_i.

The probability of a particular configuration of the nodes in the network can be calculated using Bayes’ theorem:
P(X_1, X_2, …, X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) … P(X_n | X_1, X_2, …, X_{n-1})

The conditional probabilities in a Bayesian network can be represented using conditional probability tables (CPTs), which specify the probability of each possible value of the node given each possible combination of values of its parent nodes.

Bayesian networks can be used for a variety of tasks, including probabilistic inference, parameter estimation, and model selection. Inference involves calculating the probability of a particular configuration of the nodes given evidence (i.e., observed values of some of the nodes). This can be done using the Bayesian inference algorithm, which involves updating the probabilities of the nodes based on the evidence and the probabilities of their parent nodes.

Parameter estimation involves estimating the parameters of the CPTs from data, using techniques such as maximum likelihood estimation or Bayesian estimation. Model selection involves comparing the fit of different Bayesian network structures to the data, using metrics such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC).

In summary, the mathematics behind Bayesian networks involves representing conditional probability distributions using CPTs, calculating joint probabilities using the product rule and Bayes’ theorem, and performing inference, parameter estimation, and model selection using various algorithms and techniques.

Conclusion

Causality is an important concept in machine learning that can help us understand the underlying relationships between variables in our data. By using causal inference techniques and tools like causal graphs and Bayesian networks, we can discover causal relationships and make more accurate predictions. However, causal inference is a challenging problem, and there is still much work to be done in this area.

Crash Course on Causality

What is causality?

Why is causality important in machine learning?

Correlation vs. causation

Types of causality

Causal inference

Counterfactuals

Confounding variables

Causal graphs

Interventions

Structural equation modeling

Bayesian networks

Conclusion

Written by bhargavi sikhakolli