Causal Data Science

adam kelleher
Causal Data Science
3 min readAug 12, 2016

I started a series of posts aimed at helping people learn about causality in data science (and science in general), and wanted to compile them all together here in a living index. This list will grow as I post more:

  1. If Correlation Doesn’t Imply Causation, Then What Does?

The goal of this post is to develop a basic understand of the intuition behind causal graphs. It’s aimed at a general audience, and by the end of it, you should be able to intuitively understand causal diagrams, and reason about ways that the picture might be incomplete.

2. Understanding Bias: A Prerequisite For Trustworthy Results

This post aims at a general audience. The goal is to understand what bias is, where it comes from, and how drawing a causal diagram can help you reason about bias.

3. Speed vs. Accuracy: When Is Correlation Enough? When Do You Need Causation?

The goal of this article is to understand some common errors in data analysis, and to motivate a balance of data resources to fast (correlative) and slow (causal) insights.

4. A Technical Primer on Causality

This is a very technical introduction to the material from the previous posts, aimed at practitioners with a background in regression analysis and probability.

5. The Data Processing Inequality

In order to understand observational, graphical causal inference, you need to understand “conditional independence testing”. CIT can be sensitive to how you encode your data, and it’s a problem that is sometimes swept under the rug. This article brings it into the spotlight, and is a pre-cursor to our discussion on causal inference!

6. Causal graph inference

If you can’t experiment on a system, is there any hope for establishing causality? In some cases, with certain assumptions (and not the usual “no latent variables” ones!!), the answer is “yes”. In this post, I present a teaser on some relatively old work that has been done on the subject. Next time, we’ll dig deeply into how this works!

7. What do AB tests actually measure?

When we do web experiments, we assume the population we’re experimenting on is the one we want to be experimenting on. This assumption breaks down when you focus on website growth. In this article, I investigate this problem and its implications. The upshot is your effect measurements can end up biased.

8. Causal Inference with pandas.DataFrames

Should we make causal inference easy for non-experts? Misinterpretation of correlative results as causal has lead to poor reporting on science stories. Won’t it happen even more when we enable more non-experts to use causal language? I think we can do this in a way that prevents common mistakes and improves on our existing analyses by implementing expertise with warnings and automated assumption checks in a python package.

9. How do you correct selection bias?

Data sets are generated in some context by some mechanism. In general, that simple fact can introduce spurious correlations, and cause bias in sample statistics like averages and variances. In this article, I give an overview of some recent work on the subject of selection bias, and connect it with common methods like post-stratification weighting. I detail a general solution to the problem using the s-backdoor criterion, which combines selection bias adjustment with adjustment for confounding into a general adjustment formula for causal inference.

--

--

adam kelleher
Causal Data Science

Physicist; formerly Data @ BuzzFeed; Adjunct Prof. at Columbia;