Speed vs. Accuracy: When is Correlation Enough? When Do You Need Causation?
Often, we need fast answers with limited resources. We have to make judgements in a world full of uncertainty. We can’t measure everything. We can’t run all the experiments we’d like. You may not have the resources to model a product or the impact of a decision. How do you find a balance between finding fast answers and finding correct answers? How do you minimize uncertainty with limited resources?
There are different types of uncertainty. There is uncertainty due to having a limited number of data points (noise, or random error), uncertainty due to the math tools in your analysis (which can be random or systematic, e.g. by using a biased estimator), and uncertainty due lack of knowledge of how the world operates — that is, the structure of cause and effect. This third type can arise through modeling assumptions, e.g. assumptions about which variables to include or exclude from your analysis. The first two sources of error are intimately familiar to most data scientists, but I’ve heard much less about the third. That type will be the focus of this article.
It’s fast and easy to plot individual’s education level against their income and show that the more educated someone is, the more likely they’ll have a high income. You may want to tell the story “When you educate people more, they do better in life”. How do you know if this is the right story? How do you know the correlation isn’t due to unobserved common causes, like parent’s income, or simply what career you’re interested in pursuing. If either of those are the real reason for the correlation, then the right story might be more like “You do well in life when your parents take an active role in your life, both through education, and helping to shape your career”.
Is the correlative result (without an interpretive story) useful for anything, independently of whether it’s causal? Causality is usually much harder to establish than correlation (usually through a controlled experiment). Causality is also much more powerful. If there’s a direct causal relationship between college and income, then we can act on it: you send more people to college, and they’ll make more money. If the relationship is due to unobserved common causes, then sending more people to college won’t have an effect on income. Causation is hard to find, but very powerful. Correlation is easier to find, but less powerful.
In this post, we’re going to look in-depth at when correlation gives you the right answer, and when it fails to give the right answer. We’ll see exactly where it breaks down, and develop an understanding that will let us make informed decisions, so we can balance speed and accuracy.
First, let’s look at a toy example where correlation alone is exactly the right thing to use.
One of the “spurious correlations” on this blog (which is almost certainly due to a failure to de-trend as well as small sample size) is a strong correlation between “swimming pool drownings” and “US power consumption”. Let’s let this motivate our toy example, and assume that it came about through proper analysis of a large time-series of de-trended data. Instead of drownings, let’s look at pool ownership. For our toy example (fig. 1), let’s imagine that the correlation between pool ownership and energy consumption is entirely due to confounding bias from wealth due to the following mechanisms: (1) wealthy people are more likely to own swimming pools; (2) wealthy people are also more likely to consume more electricity. I have no idea if these are true, but in this hypothetical world they are. Wealth, in this world, confounds pool ownership and energy consumption.
Suppose we want to sell pool toys. We’d like to send out an ad, but have limited funds to do so. Is there a way we can use our knowledge of the world to make sure we get the ads to people who are more likely to buy pool toys?
It turns out that this correlation is sufficient for building a strategy: If we want to select a population that over-indexes for pool ownership, then all we need to do is target a correlated trait. In this case, we can partner with the power utility to send our ad to customers with higher electricity bills. Note that higher energy consumption doesn’t have to cause pool ownership for this to work. We’re using the fact that people who consume more energy over-index for pool ownership to target a population.
From this example, we can develop our first rule: If you want to select a population that can be observed to over-index for a trait, then selecting based on a correlated trait is okay.
What if we want to go farther? What if we want to use the correlation to build a strategy where we intervene with one observed variable to drive another? It’s useful to see where this example breaks down. Let’s look at what an intervention looks like in these pictures, and see where correlation breaks down.
Correlation When You Need Causation
In our example in fig. 1, there’s no direct causal relationship between energy use and pool ownership: it’s absurd to suggest that getting people to consume more electricity would cause more people to own a pool (and thus boost sales of pool toys). What does it look like to “intervene” in the system? An intervention, for this example, is when you fix the value of some variable irrespective of its usual causes (you can handle more general interventions in this framework, but for this discussion we’ll just look at the “atomic” intervention of fixing one variable, independent of its usual causes). You choose people’s energy consumption levels, then do some action to set them at the chosen values (ridiculous examples might be: cut off their power, or run an extension cord from their house to a power consuming device). I drew a picture of the “interventional” world in fig. 2. In this world, we’ve intervened to change people’s power consumption. Doing this breaks the usual causal relationships that drive energy use. Energy use now takes the value we’ve chosen for it. These relationships are what drive the correlations we find in our non-interventional (observational) data. The result is that wealth no longer correlates with energy use in this world, and so energy use no longer correlates with pool ownership.
The correlation was due to the common cause, wealth, that was present in the observational data. Intervening results in wealth no longer causing both energy use and pool ownership. You can read directly from the second diagram that “if you intervene to set people’s energy use (cutting the in-edges to energy use), then energy use will be uncorrelated with pool ownership (and so also with propensity to buy pool toys) — there are no causal or confounding pathways connecting the two”.
The clear, simple way of modeling interventions in a system is one aspect of causal models (these pictures) that makes them so powerful. There’s a whole “calculus of intervention” that you can apply to these models to estimate the effects of policy interventions, as well as a logical system for modeling more complex interventions.
There is a second rule that we’ve found here: If your strategy involves interventions with one of the correlated variables to change the other, then correlation alone is not sufficient. You need causation.
When Correlation is Causal
Our goal is to know when a fast, easy answer is sufficient. Correlation is fast, causation is (usually) slower. The reason for this is that correlations are found with simple observational data, while causation (usually) comes from experimental data. We would love to have the power of a causal result with the speed of a correlative result. We only get that in a special case: when observation is equivalent to intervention.
We saw before that the presence of bias due to a confounder resulted in a correlation that went away when we intervened in the system. It turns out that this is illustrative of a more general principle, which is (a special case of) the second rule of the calculus on intervention: Correlation implies causation (and vice versa) whenever there is no bias. We’ve detailed what bias looks like in the last post in this series.
This brings us to our last rule, which is necessarily fuzzy in a fast-paced, high-uncertainty context: If you need a causal result, and all you have is observational data, it’s okay to act on correlation alone if you’re sure there’s no bias. That is, estimation problems aside, you’re sure that there’s no confounding, and no selection bias.
That’s a big “if”. This might be less of a useful operational rule and more of a warning: If you’re making decisions in the context of uncertainty about how the world works, this is how they can go very wrong. The rule should probably read something more like: If you have to make a decision, then ask yourself if there are any significant confounders, and if there is any significant selection bias. If the answer is to either is “Yes,” then your result isn’t causal. If it’s “No,” then while you may not be finished with all of the practical aspects you have to consider, you’ve at least done due diligence to try to test whether your decision is wrong due to a fundamental misunderstanding of how the world works.
A final note: in my personal opinion, knowing how fast decisions can fail is an excellent justification for building causal diagrams for products. If you had already done the groundwork to establish a causal diagram before you needed to use it, then you have fast, causal answers. It’s the best of both worlds. The diagram amounts to decision-making power. You’re aware of biases, and you can attempt to use conditioning to remove them when reporting observational results.