Crash Course in Causality

Published in

AI Skunks

6 min readApr 28, 2023

Causality is the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first. Causality is a fundamental concept in many areas of science, including data science, and is essential for making decisions based on data. In this crash course, we will discuss the basics of causality in data science.

1. Correlation vs. causation

One of the biggest challenges in causality is distinguishing between correlation and causation. Correlation is a statistical measure that shows how strongly two variables are related. For example, there may be a correlation between ice cream sales and temperature, but this does not mean that ice cream sales cause temperature changes. It is important to recognize that correlation does not necessarily imply causation

2. Counterfactuals

In order to establish causality, it is necessary to consider counterfactuals. A counterfactual is a hypothetical scenario in which the cause did not occur, but everything else remained the same. In other words, a counterfactual is a “what if” scenario that allows us to determine the causal effect of an intervention.

3. Randomized controlled trials

Randomized controlled trials (RCTs) are the gold standard for establishing causality. In an RCT, participants are randomly assigned to either a treatment group or a control group. The treatment group receives the intervention, while the control group does not. By comparing the outcomes of the two groups, we can determine the causal effect of the intervention.

4. Causal inference

In cases where RCTs are not possible or ethical, causal inference methods can be used to estimate causal effects. Causal inference methods aim to replicate the counterfactual scenario using observational data. These methods include propensity score matching, instrumental variable analysis, and regression discontinuity design, among others.

5. Causal diagrams

Causal diagrams are graphical representations of causal relationships between variables. These diagrams can help identify confounding variables that may affect the causal relationship between the variables of interest. Confounding variables are variables that are related to both the cause and the effect and can therefore distort the causal effect. Causal diagrams can help identify these variables and adjust for them in the analysis.

In conclusion, causality is a critical concept in data science that is necessary for making decisions based on data. Understanding the difference between correlation and causation, considering counterfactuals, using RCTs when possible, using causal inference methods when RCTs are not possible, and creating causal diagrams are all important steps in establishing causality in data science.

Now, let’s consider an example of causality. In this example, we will use data from a randomized controlled trial (RCT) to estimate the causal effect of a treatment on a continuous outcome variable.

First, let’s create some simulated data to work with. We will simulate data for a study with 1000 participants, randomly assigned to either a treatment or a control group.

import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(123)

# Simulate treatment assignment
treatment = np.random.binomial(1, 0.5, size=1000)

# Simulate baseline covariate X
X = np.random.normal(0, 1, size=1000)

# Simulate outcome Y
Y = 2 * treatment + 3 * X + np.random.normal(0, 1, size=1000)

# Combine data into a pandas DataFrame
df = pd.DataFrame({'treatment': treatment, 'X': X, 'Y': Y})

In this simulated data, the treatment assignment is a binary variable (0 or 1), and the outcome variable Y is a continuous variable. The covariate X is a continuous variable that we will use to adjust for confounding in our analysis.

Next, let’s visualize the relationship between treatment and the outcome variable Y using a scatterplot.

import seaborn as sns

sns.scatterplot(x='treatment', y='Y', data=df)

This scatterplot shows that there is a difference in the mean outcome between the treatment and control groups, but it is unclear whether this difference is due to the treatment or to confounding variables.

To estimate the causal effect of the treatment on the outcome variable Y, we will use a linear regression model that includes the treatment assignment variable and the covariate X as predictor variables. We will also include an intercept term in the model.

import statsmodels.api as sm

# Fit linear regression model
model = sm.OLS(df['Y'], sm.add_constant(df[['treatment', 'X']])).fit()

# Print model summary
print(model. Summary())

The output of the model summary shows that the treatment assignment variable has a coefficient of 1.98 and a p-value of less than 0.05, indicating that there is a significant causal effect of the treatment on the outcome variable Y, after adjusting for the covariate X.

Questions

1. What is causality in data science?

* Answer: Causality refers to the relationship between a cause and its effect, where the cause precedes the effect and there is a mechanism linking the cause and effect.

2. What is the difference between correlation and causation?

* Answer: Correlation refers to a statistical relationship between two variables, whereas causation implies that one variable directly causes a change in another variable.

3. What is confounding in the context of causal inference?

* Answer: Confounding occurs when there is a third variable that is related to both the treatment and the outcome, making it difficult to determine the true causal effect of the treatment.

4. What is selection bias in the context of causal inference?

* Answer: Selection bias occurs when the groups being compared are not equivalent at the outset of the study, due to factors such as self-selection or non-random assignment to treatment groups.

5. What is the difference between an observational study and an experimental study?

* Answer: An observational study observes individuals without any intervention, whereas an experimental study randomly assigns individuals to different treatments in order to determine the causal effect of the treatment.

6. What is a randomized controlled trial (RCT)?

* Answer: A randomized controlled trial is an experimental study in which individuals are randomly assigned to different treatment groups, allowing researchers to estimate the causal effect of the treatment.

7. What is a propensity score?

* Answer: A propensity score is a predicted probability of receiving a treatment, based on covariate information, that can be used to adjust for confounding in observational studies.

8. What is instrumental variable analysis?

* Answer: Instrumental variable analysis is a statistical technique used to estimate the causal effect of a treatment when there is confounding due to unmeasured variables, by using a variable that affects the treatment but not the outcome as an instrument.

9. What is a counterfactual outcome?

* Answer: A counterfactual outcome refers to what would have happened if an individual had received a different treatment than they actually received, and is used to estimate the causal effect of the treatment.

10. What is the difference between a direct effect and an indirect effect?

* Answer: A direct effect is the effect of a treatment on an outcome that is not mediated by any intermediate variables, whereas an indirect effect is the effect of a treatment on an outcome that is mediated by one or more intermediate variables.

License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.