Crash Course in Causality

Akhilesh Dongre

Published in

AI Skunks

11 min readApr 22, 2023

Akhilesh Dongre, Nik Bear Brown

**Does the Wind fly the Kite Or Does the Kite Blow the Wind??** Image Generated by DALL-E with prompt “ Beautiful scenic view of 2 boys over a cliff flying a kite, modern art vibrant colors”

Causal Inference: What Is It?

Causal inference seeks to address questions of causation. The use of causal inference has numerous applications. Any of the following questions can be answered using causal inference.

Did individuals who received the therapy actually benefit from it?
Was it the marketing initiative or the holiday that stimulated more sales this month?
How much of an impact would higher salaries have on output?

Essentially it is the study of the treatments and it’s outcomes.

Bayesian Inference V/S Causal Inference

The distinction between Bayesian networks and causal networks was unclear to me when I first started researching this topic. I will now quickly discuss the distinction.

On the surface, causal networks and Bayesian networks are similar. However, their interpretations are where they diverge. Think about the illustration in the image below.

Suppose a researcher is interested in studying the relationship between exercise and heart disease. They collect data on a group of individuals and record their exercise habits and whether or not they have heart disease.

Using Bayesian inference, the researcher might develop a model to estimate the probability of an individual having heart disease given their exercise habits, age, sex, and other factors. They might use prior information on the prevalence of heart disease and the distribution of exercise habits in the population to inform their model.

Using causal inference, the researcher might instead be interested in estimating the causal effect of exercise on heart disease. They might design a study in which some individuals are randomly assigned to an exercise intervention while others are not, and then compare the incidence of heart disease between the two groups. This would allow them to estimate the causal effect of exercise on heart disease while controlling for other factors that could influence the relationship, such as diet, smoking, or genetics.

In this example, Bayesian inference is focused on making predictions based on observed data, while causal inference is focused on identifying the causal effect of an intervention or treatment.

Treatment and Outcome

To estimate the causal effect of exercise on heart disease, we need to compare what happens when people receive the intervention (exercise) to what would have happened if they had not received the intervention. This comparison involves the concept of counterfactuals.

A counterfactual is a hypothetical scenario that describes what would have happened if something had been different. In the case of our example, the counterfactual is what would have happened if individuals who exercised had not exercised, or if individuals who did not exercise had exercised.

To estimate the causal effect of exercise on heart disease, we need to compare the actual outcome (heart disease) among those who received the intervention (exercise) to the counterfactual outcome among those who did not receive the intervention (no exercise). This comparison involves estimating the difference between the two groups, which is the causal effect of exercise on heart disease.

In contrast, Bayesian inference is focused on using prior knowledge and data to make statistical predictions. In the example I gave, Bayesian inference might be used to predict the probability of an individual having heart disease given their exercise habits and other factors. This prediction does not necessarily involve the concept of treatment or counterfactuals, as it is based on observed data and does not involve comparing what would have happened if the individual had or had not received a particular intervention.

Now, We will discuss some basic concepts of Causal Inferences through Python Code

Demonstration of confounding in causal inference

Suppose we are interested in investigating the effect of a binary treatment variable (T) on an outcome variable (Y), and we suspect that a continuous covariate (X) may be a confounder of this relationship. We can use the following code to simulate a dataset with these variables:

In this code, we first set the random seed to ensure that the results are reproducible. We then simulate 1000 values of the continuous covariate (X) from a normal distribution with mean 0 and standard deviation 1, a binary treatment variable (T) with a probability of success of 0.5, and an outcome variable (Y) with a linear relationship to T and X, as well as random error from a normal distribution with mean 0 and standard deviation 1. Finally, we create a pandas dataframe with the simulated data.

To investigate the potential confounding effect of X, we can first calculate the unadjusted treatment effect as the difference in means between the treatment and control groups:

This code subsets the dataframe by treatment group and calculates the mean outcome variable for each group, then subtracts the mean for the control group from the mean for the treatment group to obtain the unadjusted treatment effect.

import numpy as np
import pandas as pd

np.random.seed(1234)

# Simulate data
n = 10
X = np.random.normal(loc=0, scale=1, size=n)
T = np.random.binomial(n=1, p=0.5, size=n)
Y = 2*T + 0.5*X + np.random.normal(loc=0, scale=1, size=n)

# Create dataframe
df = pd.DataFrame({'X': X, 'T': T, 'Y': Y})

# Unadjusted treatment effect
te_unadj = df.loc[df['T'] == 1, 'Y'].mean() - df.loc[df['T'] == 0, 'Y'].mean()

print('Unadjusted treatment effect: ', round(te_unadj, 2))

Next, we can adjust for the potential confounding effect of X by fitting a linear regression model with T and X as predictor variables:

# Adjusted treatment effect
import statsmodels.api as sm

model = sm.OLS.from_formula('Y ~ T + X', data=df)
results = model.fit()

te_adj = results.params['T']

print('Adjusted treatment effect: ', round(te_adj, 2))

Unadjusted treatment effect: 1.63

Adjusted treatment effect: 2.22

We fit a linear regression model with Y as the response variable, and T and X as predictor variables. We then extract the coefficient for T from the model results to obtain the adjusted treatment effect.

Comparing the unadjusted and adjusted treatment effects allows us to see the potential impact of confounding in this relationship. If the adjusted treatment effect is substantially different from the unadjusted treatment effect, this suggests that the covariate X is a confounder and needs to be accounted for in the causal inference.

Demonstration of Stratification in causal inference

Suppose we are interested in investigating the effect of a binary treatment variable (T) on an outcome variable (Y), and we suspect that a categorical covariate (Z) may be a confounder of this relationship. We can use the following code to simulate a dataset with these variables:

The Causal Relationship we are exploring between Z,T and Y

Now, we first set the random seed to ensure that the results are reproducible. We then simulate 1000 values of the categorical covariate (Z) by randomly sampling from the set [‘A’, ‘B’, ‘C’], a binary treatment variable (T) with a probability of success of 0.5, and an outcome variable (Y) with a linear relationship to T and a binary indicator variable for Z == ‘B’, as well as random error from a normal distribution with mean 0 and standard deviation 1. Finally, we create a pandas dataframe with the simulated data.

To investigate the potential confounding effect of Z, we can first calculate the unadjusted treatment effect separately for each level of Z:

np.random.seed(283)

# Simulate data
n = 120
Z = np.random.choice(['A', 'B', 'C'], size=n)
T = np.random.binomial(n=1, p=0.5, size=n)
Y = 2*T + 0.5*(Z == 'B') + np.random.normal(loc=0, scale=1, size=n)

# Create dataframe
df = pd.DataFrame({'Z': Z, 'T': T, 'Y': Y})

# Unadjusted treatment effect by strata
strata = df.groupby('Z')

te_unadj = strata.apply(lambda x: x.loc[x['T'] == 1, 'Y'].mean() - x.loc[x['T'] == 0, 'Y'].mean())

print('Unadjusted treatment effect by strata: \n', te_unadj)

Result: Unadjusted treatment effect by strata: Z A 1.584659 B 1.800243 C 1.946186

This code groups the dataframe by the categorical covariate Z and applies a function to each group to calculate the unadjusted treatment effect separately for each stratum.

Next, we can adjust for the potential confounding effect of Z by stratifying the data by Z and fitting a linear regression model with T as a predictor variable in each stratum:

# Adjusted treatment effect by strata
te_adj = []

for z in ['A', 'B', 'C']:
    model = sm.OLS.from_formula('Y ~ T', data=df.loc[df['Z'] == z])
    results = model.fit()
    te_adj.append(results.params['T'])

te_adj = pd.Series(te_adj, index=['A', 'B', 'C'])

print('Adjusted treatment effect by strata: \n', te_adj)

Result : Adjusted treatment effect by strata: A 1.584659 B 1.800243 C 1.946186

This code uses a for loop to iterate over each level of the categorical covariate Z and fit a linear regression model with Y as the response variable and T as the predictor variable in each stratum. We then extract the coefficient for T from each model to obtain the adjusted treatment effect in each stratum.

Comparing the unadjusted and adjusted treatment effects within each stratum allows us to see the potential impact of confounding in each subgroup. If the adjusted treatment effect is substantially different from the unadjusted treatment effect within a particular stratum, this suggests that the covariate Z is a confounder and needs to be accounted for in the causal inference.

Unadjusted values refer to the raw or observed association between the treatment and the outcome, without accounting for any potential confounding variables that may be affecting the relationship between the two. In other words, unadjusted values represent the “naive” effect of the treatment on the outcome.

Adjusted values, on the other hand, refer to the effect of the treatment on the outcome after accounting for potential confounding variables. Adjusted values are obtained by using statistical methods, such as regression modeling, to adjust for the effects of confounding variables that may be associated with both the treatment and the outcome.

Lets explore a sample dataset to observe Causality

The dataset consists of web analytics data trying to capture the user click rate. The dataset consists of tagname, visisblity of the tag, num clicks and the different versions.Lets see how Causal Inference can be used on data set on Python using Pyro

The DAG which we are considering to find the Causal inference for

def model():
    V = pyro.sample("V", dist.Categorical(probs=V_prob))
    T = pyro.sample("T", dist.Categorical(probs=T_prob[V]))
    S = pyro.sample("S", dist.Categorical(probs=S_prob[V]))
    C = pyro.sample("C", dist.Categorical(probs=C_prob[T][S]))
    return{'V': V,'S': S,'T': T,'C': C}

# Getting the cpt tables from bnlearn fit values
V_alias = ['v1','v2','v3', 'v4', 'v5']
T_alias = ['a','area', 'button', 'center', 'div', 'font', 'form', 'img', 'input', 'li', 'object', 'p', 'span', 'strong', 'ul']
S_alias = ['False','True']
C_alias = ['HIGH','LOW', 'MEDHIGH', 'MEDIUM', 'MEDLOW']

Query 1: To find probability of versions, given an evidence. i.e. This is an interesting query as we go against the direction of the DAG.

Evidence: Click through is MEDHIGH and visibility of tag is TRUE

conditioned_model_1 = pyro.condition(model, data={'C':torch.tensor(2), 'S': torch.tensor(1)})

V_posterior = Importance(conditioned_model_1, num_samples=5000).run()
V_marginal = EmpiricalMarginal(V_posterior,"V")
V_samples = [V_marginal().item() for _ in range(5000)]
V_unique, V_counts = np.unique(V_samples, return_counts=True)

plt.bar(V_unique, V_counts/5000, align='center', alpha=0.5)
plt.xticks(V_unique, V_alias)
plt.ylabel('Posterior Probability')
plt.xlabel('Versions')
plt.title('P(V | C = MEDHIGH, S= Visible) - Importance Sampling')

intervention_condition = pyro.do(model, data={'C':torch.tensor(2), 'S': torch.tensor(1)})

V_posterior = Importance(intervention_condition, num_samples=5000).run()
V_marginal = EmpiricalMarginal(V_posterior,"V")
V_samples = [V_marginal().item() for _ in range(5000)]
V_unique, V_counts = np.unique(V_samples, return_counts=True)

plt.bar(V_unique, V_counts/5000, align='center', alpha=0.5)
plt.xticks(V_unique, V_alias)
plt.ylabel('Posterior Probability')
plt.xlabel('Versions')
plt.title('P(V | do(C = MEDHIGH, S= Visible) - Importance Sampling')

Change in Visbility brings change in Version

Query 2: To find the probability of tag names given evidence about click rate.

Evidence: Click through is HIGH

conditioned_model_2 = pyro.condition(model, data={'C':torch.tensor(0)})

T_posterior = Importance(conditioned_model_2, num_samples=5000).run()
T_marginal = EmpiricalMarginal(T_posterior,"T")
T_samples = [T_marginal().item() for _ in range(5000)]
T_unique, T_counts = np.unique(T_samples, return_counts=True)
plt.figure(figsize=(15,10))
plt.bar(T_unique, T_counts/5000, align='center', alpha=0.5)
plt.xticks(T_unique, T_alias)
plt.ylabel('Posterior Probability')
plt.xlabel('Tag-Names')
plt.title('P(T | C = HIGH) - Importance Sampling')

An interesting observation is that, from data, we see that the probability of the area tag in occuring is very low in 0.001 range. But given the evidence that the click rate is HIGH we see that the probability of the area tag goes up 10 fold.

intervention_model_visible = pyro.do(model, data= {"S": torch.tensor(1)})
intervention_model_visible_no = pyro.condition(model, data= {"S": torch.tensor(0)})


C_posterior = Importance(intervention_model_visible, num_samples=5000).run()
C_marginal = EmpiricalMarginal(C_posterior,"C")
C_samples = [C_marginal().item() for _ in range(5000)]
C_unique, C_counts = np.unique(C_samples, return_counts=True)
plt.bar(C_unique, C_counts/5000, align='center', alpha=0.5)
plt.xticks(C_unique, C_alias)
plt.ylabel('Posterior Probability')
plt.xlabel('Click Rate')
plt.title('P(C | do(Visible=True)) - Importance Sampling')

Causal Effect Query

Does Setting visibility to TRUE for all elements make any effect ?

def causal_effect(val):
  c_samples_visible = [
    1 if intervention_model_visible()['C'] == val else 0
    for _ in range(5000)
  ]
  c_samples_not_visible = [
      1 if intervention_model_visible_no()['C'] == val else 0
      for _ in range(5000)
  ]

  causal_effect = np.mean(c_samples_visible) - np.mean(c_samples_not_visible)
  return causal_effect

for lvl in C_alias:
  diff = causal_effect(C_alias.index(lvl))
  print(f"E(Click = {lvl} | do(Visible = True) - E(Click = {lvl} | do(Visible = False))) is {diff}")

E(Click = HIGH | do(Visible = True) — E(Click = HIGH | do(Visible = False))) is -0.01179

E(Click = LOW | do(Visible = True) — E(Click = LOW | do(Visible = False))) is -0.129

E(Click = MEDHIGH | do(Visible = True) — E(Click = MEDHIGH | do(Visible = False))) is -0.012

E(Click = MEDIUM | do(Visible = True) — E(Click = MEDIUM | do(Visible = False))) is -0.00639

E(Click = MEDLOW | do(Visible = True) — E(Click = MEDLOW | do(Visible = False))) is 0.1582

The likelihood of MED-LOW is enhanced and the probability of LOW is drastically decreased when all tags are made visible, as can be seen. By making all tags accessible, there is a 15% probability that the click rates will rise from the 0–10 range to the 10–100 range.

Conclusion

Through the above use case of Click through rate we can observe causality w.r.t. Visibility varied through HIGH, LOW and MEDIUM. and how these in affect impact the Click through rate.

Link to github notebook

References

License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.