Causal Data Science
I started a series of posts aimed at helping people learn about causality in data science (and science in general), and wanted to compile them all together here in a living index. This list will grow as I post more:
The goal of this post is to develop a basic understand of the intuition behind causal graphs. It’s aimed at a general audience, and by the end of it, you should be able to intuitively understand causal diagrams, and reason about ways that the picture might be incomplete.
This post aims at a general audience. The goal is to understand what bias is, where it comes from, and how drawing a causal diagram can help you reason about bias.
The goal of this article is to understand some common errors in data analysis, and to motivate a balance of data resources to fast (correlative) and slow (causal) insights.
This is a very technical introduction to the material from the previous posts, aimed at practitioners with a background in regression analysis and probability.
In order to understand observational, graphical causal inference, you need to understand “conditional independence testing”. CIT can be sensitive to how you encode your data, and it’s a problem that is sometimes swept under the rug. This article brings it into the spotlight, and is a pre-cursor to our discussion on causal inference!
If you can’t experiment on a system, is there any hope for establishing causality? In some cases, with certain assumptions (and not the usual “no latent variables” ones!!), the answer is “yes”. In this post, I present a teaser on some relatively old work that has been done on the subject. Next time, we’ll dig deeply into how this works!
When we do web experiments, we assume the population we’re experimenting on is the one we want to be experimenting on. This assumption breaks down when you focus on website growth. In this article, I investigate this problem and its implications. The upshot is your effect measurements can end up biased.
Should we make causal inference easy for non-experts? Misinterpretation of correlative results as causal has lead to poor reporting on science stories. Won’t it happen even more when we enable more non-experts to use causal language? I think we can do this in a way that prevents common mistakes and improves on our existing analyses by implementing expertise with warnings and automated assumption checks in a python package.
Data sets are generated in some context by some mechanism. In general, that simple fact can introduce spurious correlations, and cause bias in sample statistics like averages and variances. In this article, I give an overview of some recent work on the subject of selection bias, and connect it with common methods like post-stratification weighting. I detail a general solution to the problem using the s-backdoor criterion, which combines selection bias adjustment with adjustment for confounding into a general adjustment formula for causal inference.