# A Technical Primer On Causality

What does “causality” mean, and how can you represent it mathematically? How can you encode causal assumptions, and what bearing do they have on data analysis? These types of questions are at the core of the practice of data science, but deep knowledge about them is surprisingly uncommon.

If you analyze data without regard to causality, you open your results up for the possibility of enormous biases. This includes everything from recommendation system results, to post-hoc reports on observational data, to experiments run without proper holdout groups.

I‘ve been blogging a lot recently about causality, and wanted to go through some of the material at a more technical level. Recent posts have been aimed at a more general audience. This one will be aimed at practitioners, and will assume a basic working knowledge of math and data analysis. To get the most from this post you should have a reasonable understanding of linear regression and probability (although we’ll review a lot of probability). Prior knowledge of graphical models will make some concepts more familiar, but is not required.

**How do you quantify causality?**

Judea Pearl, in his book *Causality,* constantly remarks that until very recently, causality was a concept in search of a language. Finally, there is a mathematically rigorous way of writing what we’ve wanted to write explicitly all along: X causes Y. We do this in the language of **graphs.**

This incredibly simple picture actually has a precise mathematical interpretation. It’s trivial in the two variable case, but what it implies is that the joint distribution of X and Y can be simplified in a certain way. That is, P(X,Y) = P(Y|X)P(X). Where it gets really interesting is when we build the theorems of causality on top of this mathematical structure. We’ll do that, but first let’s build some more intuition.

Let’s consider an example of three variables. X causes Y, which in turn causes Z. The picture looks like this:

This could be, for example, I turn on the light switch (X), current flows through the wires (Y), and the light comes on (Z). In this picture, X puts some information into Y (knowing there is current, we know the switch is on). Y contains all of the information about X that is relevant for determining the value of Z: knowing the value of Y makes X irrelevant. We only need to know that there is power flowing — the switch is now irrelevant. X only effects Z through its effect on Y. This picture is a very general way of summarizing these relationships, and can apply to any system with the same causal structure.

How does this causal relationship look mathematically? The distribution of a variable summarizes all of our statistical knowledge about it. When you know something about a variable (say, if we’re interested in Z, we know the value of Y), you can narrow down the possible values it can take to a smaller range. There’s another distribution that summarizes this new state of knowledge: the *conditional distribution. *We write the distribution Z takes on given a certain value of Y, say y, as P(Z|Y=y). We’ll usually suppress the extra notation, and just write it as P(Z|Y).

Here’s where this picture gets interesting. When we know Y, X doesn’t provide any additional information. Y already summarizes everything about X that is relevant for determining Z. If you look at the distribution of values that Z takes when we know the value of Y, P(Z|Y), it should be no different from the distribution that we get if we also know the value of X, P(Z|Y,X). In other words, P(Z|Y,X) = P(Z|Y). When knowing Y makes X irrelevant, we say that Y “blocks” X from Z. Another technical term is that Y **d-separates** X and Z. A final way of writing this is Z ⏊ X | Y, which you can read as “Z is independent of X, given Y”. These graphs summarize all statements about “blocking” in the system we’re interested in. This concept goes well beyond the 3-variable chain example.

There’s a nice sense (and it’s not the ordinary one!) in which you can say Z is independent of X. It’s a version of independence *in the context of other information. *This kind of independence amounts to something more like “the dependence has been explained away”. It’s not too hard to show (do it!) that when P(Z|Y,X) = P(Z|Y), it’s also true that P(Z,X|Y) = P(Z|Y)P(X|Y). This is the definition of **conditional independence**. It looks exactly like independence (which you’d write as P(Z,X) = P(Z)P(X)), except that you condition on Y everywhere. Conditional independence is neither necessary nor sufficient for independence. In our example, even though X and Z are conditionally independent (given Y), they are statistically dependent. In our example, where X → Y → Z, X and Z will indeed be correlated (more precisely, they will be statistically dependent).

To summarize, the most important points up to here are that (1) we can quantify causation using graphs, and (2) these graphs have real implications for the statistical properties of a system (namely, they summarize independence and conditional independence).

Finally, it turns out there’s a very general rule when you have a picture like this. We saw that the direct cause of a variable contains all of the (measured) information relevant for determining its value. This generalizes: you can factor the joint distribution into one factor per variable, and that factor is just the probability of that variable given the values of its “parents,” or the other variables pointing in to it, where, for example, above, the parents of Z, or *par(Z), *is evaluated as *par*(Z) = {Y} (the set of variables containing only *Y)*. To write it out generally,

Or, more succinctly using product notation,

Before we move on to an example, I want to make one point of clarification. We’re saying that causal structure can be represented as a graph. As it happens, there are plenty of graphs that represent factorizable joint distributions, but* don’t* have a causal interpretation. In other words, causal structure implies a graph, but a graph does not imply causal structure!

**Example Time**

First, lets look at this X → Y → Z example. It’s easy to generate a data set with this structure.

X is just random, normally distributed data. In real life, this would be something that’s generated by factors you’re not including in your model. Y is a linear function of X, and it has its own un-accounted-for factors. These are represented by the noise we’re adding to Y. Z is generated from Y similarly as Y is generated from X.

We can see that X and Y are correlated, and are noisily, linearly related to each other. Similarly, Y and Z are related, and X and Z are related.

From this picture, you’d never know that Y can explain away X’s effect on Z. This won’t be obvious until we start doing regression analysis.

Now, let’s do a regression. Z should be related to Y (look at the formulae and convince yourself!), and should have a slope equal to it’s coefficient in the formula before. We can see this by just doing the regression,

This is great, and it’s what we expect. Similarly, you can regress Z on X, getting the result

You find a similar result as before when you regress Z on X, where the coefficient is just the product of the X →Y coefficient and the Y → Z coefficient.

Now this is all as we expect. Here’s where it’s going to get interesting. If we regress Z on Y and X together, something weird happens. Before, they both had non-zero coefficients. Now, we get

The X coefficient goes away!! It’s not statistically significantly different from zero. Why is this!? (note: by chance, it’s actually relatively large, but still not stat. sig. I could have cherry-picked a dataset that had it closer to zero, but decided I wouldn’t encourage the representativeness heuristic). Notice there is also no significant improvement in the R^2!

You may have run into something similar to this in your work, where regressing on two correlated independent variables gives different coefficients than regressing on either independently. This is one of the reasons (that is, causal chains) for that effect. It’s not simply that the regression is unstable and giving you wrong estimates. You may try to get around the issue through regularization, but the problem is deeper than the empirical observation of some degeneracy in regression coefficients. One of your variables has been explained away. Let’s look at what this means quantitatively. We’ll see that the implications go way beyond our example. Let’s see how deep it goes.

**Down the rabbit hole…**

The regression estimator, *z(x)*, is really the expectation value of a distribution. For some value of *x, *it’s your best guess for the value of *z: *the average *z* at that value of *x. *The regression estimate could properly be written

In this case, all of the relationships are linear (by construction), and so we’ll assume that the expectation takes on a linear form,

where the epsilon term is the total noise when you use this equation, and the beta coefficient is derived using the data-generating formulas in the code above. Looking at our original formulae,

and plugging in the formula for the value of y (given the value of x),

So the coefficient is just the product of the Y → Z and X → Y coefficients. Looking at these formulae, if we write the estimator *z(x,y)* as a function of *x* and *y, *it’s clear that we now know the value that both *x *and the disturbance term take. The variable *y*, together with some noise, are the only variables that determine *z.* If we don’t know *y, *then knowing *x* still leaves the fuzziness from the *y *noise term, \epsilon_{yx} (unfortunately, no inline LaTeX in Medium!). If we know *y,* then this fuzziness is gone. The value that the random noise term takes on has already been decided. This is why the R^2 from regressing on *x* is so much smaller than that from regressing on *y*!

Now, what happens when we try to make a formula involving *x* and *y* together? We’ll see that the *x* dependence goes away! This is because of d-separation! If you write the regression estimate of *z *on both *y* and *x*, you get

But we saw before that this distribution is the same without *x! *Plugging in P(z|x,y) = P(z|y), we get

In other words, the regression is independent of *x*, and the estimate is the same as just regressing on *y *alone!

Indeed, comparing the coefficients of these two regressions, we see that the *y *coefficient is the same in both cases, and we’ve correctly estimated the effect of Y on Z.

When you’re thinking causally, it’s nice to keep in mind the distributions that generate the data. If we work at the level of the regression model, we lose the probability manipulations that make it clear why coefficients disappear. It’s only because we go back to the distributions, with no assumptions on the form of expectations values, that the coefficient’s disappearance is clearly attributable to the causal relationships in the data set.

**Direct vs. Total Effect**

Let’s examine what we’re measuring a little more closely. There is a difference between the direct effect one variable has on another, and the effect it has through long, convoluted chains of events. The first kind of effect is called the “direct” effect. The sum of direct effects and indirect chains of events are the “total” effect.

In our example above, regressing Z on X correctly estimates the total effect of X on Z, but incorrectly estimates the direct effect. If we interpreted the regression coefficient as the direct effect, we’d be wrong: X only effects Z through Y. There is no direct effect. This all gets very confusing, and we really need some guiding principles to sort it all out.

Depending on what you’re trying to estimate (the total or the direct effect), you have to be very careful what you regress on! The hardest thing is *the right variables to regress on depend on the graph.* If you don’t know the graph, then you don’t know what to regress on.

You might be thinking at this point “I’ll just see if a coefficient goes away when I add variables to my regression, so I’ll know whether it has a direct or indirect effect”. Unfortunately, it’s not so simple. Up to this point we’ve only been thinking about chain graphs. You can get large, complicated graphs that are much more difficult to sort out! We’ll see that the problem is that, in general, you have to worry about **bias **when you’re trying to measure the total effect of one variable on another.

There are two other 3-node graph structures, and they turn out to be very nice for illustrating where bias comes from. Let’s check them out!

**Bias**

Let’s go ahead and draw the other two graph structures (technically a third is Z →Y →X, but this is just another chain). Below, we have the fork on the left, and the collider on the right. The left graph might be something like online activity (Y) causes me to see an ad (X), and to make online purchases for an irrelevant product (Z). The one on the right might be something like having skills at math (X) and skill at art (Z) both have an effect on my admission to college (Y). The left graph shows two variables, X and Z, that are related by a *common cause*. The right graph shows X and Z with a *common effect.*

Neither of these two diagrams have a direct or indirect causal relationship between *X* and *Z*. Let’s go ahead and generate some data so we can look at these graphs in more detail!

Plotting this data gives the following. For the fork (the left graph):

Notice that everything is correlated. If you regressed Z on X, you would indeed find a non-zero regression coefficient, even though there’s no causal relationship between them at all! This is called **confounding** bias. Try this plotting yourself for the collider (the right graph)!

We would like to be able to estimate the right effect — is there a way we can use d-separation and conditional independence to do this?

It turns out there is! If you read the previous post on bias, then you have a strong intuitive grasp of how conditioning removes bias. If you haven’t, I strongly encourage you to read it now. I’ll still try to give a concise explanation here.

The fork causes Z and X to be correlated because as Y changes, it changes both X and Z. In the example I mentioned in passing: if you’re a more active internet user, you’re more likely to see an ad (because you’re online more). You’re also more likely to purchase a product (even one irrelevant to the ad) online. When one is higher, so is the other. The two end up correlated even though there’s no causal relationship between them.

More abstractly, in the case of linear relationships (like in our toy data), Y being larger makes both X and Z larger. If we fix the value of Y, then we’ve controlled any variation in X and Z caused by variation in Y. Conditioning is exactly this kind of controlling: it says “What is the distribution of X and Z when we know the particular value of Y is y.” It’s nicer to write P(Z,X|Y) as P(Z,X|Y=y) to emphasize this point.

From this argument, it’s clear that in this diagram Z and X should be independent of each other when we know the value of Y. In other words, P(Z,X|Y) = P(Z|Y)P(X|Y), or, “Y d-separates X and Z”. You can actually derive this fact directly from the joint distribution and how it factorizes with this graph. It’ll be nice to see, so you can try the next one! Let’s do it:

Applying the factorization formula above,

then, from the definition of conditional probability

Indeed, we can see this when we do the regression. The *x* coefficient is zero when we regress on *y* as well. Notice this is exactly the same result as in the previous case! Indeed, if you want to estimate the effect of X on Y when they are confounded, the solution is to “control for” (condition on) the confounder.

To make it clear that this isn’t just because we’re working with linear models, you could calculate this result directly from the expectation value, as well:

so Z is indeed independent of X (given Y). This comes out directly as a result of the (statistical properties of) the causal structure of the system that generated the data.

Try doing this calculation yourself for the collider graph! Are X and Z dependent initially (consult your plots!)? Are they dependent when you condition on Y (try regressing!)? Spoilers below!

The other side of this calculation is that we can show X and Z, while having no direct causal relationship, should generally be dependent at the level of their joint distribution.

This formula doesn’t factor any more. It’s clear here that the coupling between the two pieces that *want* to be factors comes through the summation over Y. Except in special cases, X and Z are coupled statistically through their relationship with Y. Y causes X and Z to become dependent.

So we’ve seen a few results from looking at this graph. X and Z are generally dependent, even though there’s no causal relationship between them. This is called “confounding,” and Y is called a “confounder”. We’ve also seen that conditioning on Y causes X and Z to become statistically independent. You can simply measure this as the regression coefficient of X on Z, conditional on Y.

If you repeated this analysis on the collider (the right graph), you’d find that X and Z are generally independent. If you condition on Y, then they become dependent. The intuition there, as described in a previous post, is that knowing an effect of two possible causes, and knowing one of the causes, you learn something about the other. If the system you’re thinking of is a sidewalk, and the causes of it being wet or not (rain or a sprinkler), then knowing that the sidewalk is wet, and it didn’t rain, it’s more likely that the sprinkler was on.

This raises are very interesting point. In both of these pictures, there was no direct causal relationship between X and Z. In one picture, you estimate the correct direct effect of X on Z by conditioning on Y, and the incorrect effect if you don’t. In the other picture, you estimate the correct effect when you don’t condition, and the incorrect effect if you do!

How do you know when to control and when not to?

**The “back door” criterion**

There is a criterion you can use to estimate the total causal effect of one variable on another variable. We saw that conditioning on the variable in between X and Z on the chain resulted in measuring no coefficient, while not conditioning on it resulted in measuring the correct total effect

We saw that conditioning on the central variable in a collider estimated the wrong total effect, but not conditioning on it estimated the correct total effect.

We saw that conditioning on the central variable in a fork gave the correct total effect, but not conditioning on it gave the wrong one.

What happens when the graphs are more complicated? Suppose we had fully connected graph on ten variables. What then?!

It turns out that our examples above provide the complete intuition required to understand the more general result. The property for identifying the right variables to control is called the “back door criterion,” and it is *the* general solution for finding the set of variables you should control for. I’m going to throw the complete answer out there, then we’ll dissect it. Directly from Pearl’s *Causality, 2nd Ed.:*

This criterion is extremely general, so let’s pick it apart a little. Z are the variables he’s saying we’re controlling for. We want to estimate the effect of X_i on X_j. DAG stands for “directed acyclic graph,” and it’s just a mathematically precise necessary property for being a causal graph (technically you can have cyclic ones, but that’s another story). We’ll probably go into that in a later post. For now, let’s take a look at these criteria.

Criterion (i) says not to condition on “descendents” of the cause we’re interested in. A descendent is an effect, an effect of and effect, and so on. (The term comes from genealogy: you’re a descendant of your father, your grandfather, etc.). This is what keeps us from conditioning on the middle variable in a chain, or the middle variable of a collider.

Criterion (ii) says “if there is a confounding path between the cause and effect, we condition on something along the confounding path (but not violating criterion (i)!)”. This is what makes sure we condition on a confounder.

There’s a lot on interesting nuance here, but we’re just working on learning the basics for now. Let’s see how all of this comes into play when we want to estimate an effect. Pearl gives *the* formula for controlling:

but how is this related to regression?

Let’s think of a “causal regression”. We want the expectation value we expect y to take on given that X takes the value of x (with the hat on top). Another way of writing this is *do(x)*, meaning that we intervene to fix the value of x. This is the true total causal effect of x on y.

So here, the expectation value is just our regression estimate! It behaves how we’ve seen above when we condition on various control variables, and does what we expect it to.

There’s an extra P(Z) term, and a summation over Z. All this is doing is weighting each regression, and taking an average regression effect over the values the control variable, Z, takes on. So now we have a “causal regression”!

**Problems in practice**

The main problem with implementing this approach in practice is that it assumes knowledge of the graph. Pearl argues that causal graphs are really very static things, so lend themselves well to being explored and measured over time. Even if the quantitative relationships between the variables change, it’s not likely that the causal structures do.

In my experience, it‘s relatively difficult to build causal graphs from data. You can do it from a mixture of domain knowledge, statistical tests, logic, and experimentation. In some future posts, I’ll take some sample open-data data sets and see what we can say about causality based on observational data. For now, you’ll have to take my word for it: it’s hard.

In light of this, this approach is extremely useful for seeing where estimating causal effects based on observational data breaks down. This breakdown is the basis for “correlation does not imply causation” (i.e. P(Y|X) != P(Y|do(X)). The underlying assumption that causality can be represented as a causal graph is the basis for the statement “there is no correlation without causation”. I haven’t yet seen a solid counter-example to this.

In the end, if you can show that a causal effect has no bias, then the observational approach is just fine. This may mean doing an experiment once to establish that there’s no bias, then taking it on faith thereafter. At the least, that approach avoids repeating an experiment over and over to make sure nothing has changed about the system.

This is an incredibly powerful framework. At the very least, it gives you a way of talking to someone else about what you think the causal structure of a system is. Once you’ve written it down, you can start testing it! I think there are some exciting directions for applications, and I’ll spend the remainder of the post pointing in one of them:

**Machine Learning vs. Social Science**

In machine learning, the goal is often just to reduce prediction error, and not to estimate the effects of interventions in a system. Social science is much more concerned with the effects of interventions, and how those might inform policy.

At IC2S2 recently, in Sendhil Mullainathan’s keynote, he called this the “beta-problem” (the focus on the regression coefficients, more common in social science) and the “y-hat problem” (the focus on the actual prediction, more common in machine learning). Extreme examples are as in lasso regression, where the variables that are selected vary in random subsets of the data set! There is no consistent estimate of the coefficients in that case: it’s a pure y-hat problem.

Here, there’s a very interesting application to machine learning: knowing the causal graph, you could use the back-door criterion for subset selection. Once you’ve solved the “beta-problem,” you might be better at solving the “y-hat” problem. This wouldn’t be hard to test using an ensemble of random data sets: simply generate random graphs, and compare lasso in cross-validation against the back-door approach. I went ahead and did this for a slightly simple approach to finding blocking sets. It turns out that the parents of a node d-separate it from its predecessors. Conditioning on the parents (X) of a node (Y), Z is the empty set. I used this simple rule along with this code for random data sets to calculate the cross-validation R^2 for a data set with 30 variables, and N=50. On average, the dependent variable had 9 parents. If we histogram the number of data sets vs. their cross-validation R^2, we can see that lasso is clearly an improvement over a naive linear regression

Now, compare lasso with regression on the parents alone (a simple way of applying the back-door criterion)

That looks pretty nice! In fact, regression on the parents worked better than lasso! (I chose the lasso parameter using grid-search to initialize [and find a bounding range for] a function minimization, so it wasn’t an easy competitor to beat!). Discovering and regressing on the direct causes of Y (instead of dumping all of the data into a smart algorithm) solves the “y-hat” problem for free. In other words, in this context, solving the “beta-problem” (that is, finding the true causes and estimating their direct effects) solves the “y-hat” problem.