Correlation is not causation: but why?

Jesse Jing
5 min readApr 11, 2022

We heard too many times that correlation != causation, but nobody seems bothered to ask why after hearing it. In human nature, we tend to assume that: only if A causes B, then B changes when A changes. Let’s see why that assumption is wrong with the help of the example below: school bullying.

Photo by Martin Olsson on Unsplash

School Bullying Example

Student X bullied student Y.

Student Y got depressed.

Student Y committed suicide.

We describe the three events shown above as nodes A, B, and C. We want to develop a graph to illustrate the qualitative causal relation. The qualitative relation, in this case, means that we know that:

The behavior that student X bullied student Y will make student Y depressed. But we don’t know how many times X bullied Y will lead to depression. (Once a week? oOnce a day? Every time they encounter each other?)

By omitting the quantitative relations(usually, it requires more data to be sure of quantitative ties), we can form some hypotheses (stories) about the data and test them by conditioning on some of it and mine the causal relationship.

The reason we do this (tweaking the data instead of conducting a controlled experiment) is primarily the cost. Sometimes it’s also impossible to conduct a large-scale investigation where you don’t have control over some factors: age/physical condition/ etc. We have to live with the data available at hand.

Graphical Model and Data

Qualitative graph: a story we assume is true based on the domain knowledge.

As we can see in the above figure, if there’s an arrow from node A to node B, we say that A causes B (but since it’s qualitative only, we don’t know the level of causality). If A points B and B points C, we also know that A causes C to some extent.

Assuming that we have a dataset like this from a research facility working on suicidal attempts. There’re three columns corresponding to the previous three events.

Fake data

We want to answer this question:

Is the qualitative graph valid based on the data we got?

Conditioning on variables is crucial

Data is deceptive. Check out the example on cholesterol below from Prof. Judea Pearl’s book: CAUSAL INFERENCE IN STATISTICS: A PRIMER.

On the left: cholesterol study, unsegregated; On the right: cholesterol study, segregated by age.

When we condition on different ages, the data support entirely different stories. And we know it for a fact that exercise leads to low cholesterol. In other cases, causality is not so apparent.

Back to our bullying example, if we condition on event B (Depression level of students), does bullying directly causes suicidal attempts? Counterintuitively, it’s not true if our qualitative graph is right.

By conditioning on B, we remove the causal effects from A to B and B to C. Then only exogenous variables, which by definition are independent from each other, affects A and C. Hence, A does NOT causes C directly.

Then we go to the dataset and group data points w.r.t. the depression level. If it is true that variable A (Get bullied or not) is independent of variable C (Commit suicide or not), we conclude that our story (qualitative graph)is accurate. If the grouped data contradicts our graph, we know that there’s something wrong with our structural causal graph. The following graph could be our new story,

Two alternative stories: represented by confounder and collider structures

A confounder refers to a variable that causes the other two independent variables; A collider refers to a variable was caused by the two independent variables.

The exogenous variables are omitted for clarity purposes. Tell me, what is the story we are trying to test here, respectively?

We will use collider to show why correlation does NOT lead to causation. In the right graph, we assume that event A and event B are independent of each other for that they have no arrows in between. And we believe that A causes C directly and B causes C directly.

Suppose we condition on event C, meaning that we only look at the data where students haven’t committed suicide. In that case, we will suddenly find that A and B are strongly correlated to make event C generate the same outcome.

For example, there’re several cases where students haven’t committed suicide. Based on naive judgment, we know that if they are depressed and bullied at the same time, they are likely to break the edge. It’s most likely that in this “no suicidal attempts subgroup,” students either get bullied or are depressed so that they can handle the situation. (negatively correlated!)

Thus, if we only collect the data of event A and event B in this case, we will see that A and B are negatively correlated, but it does NOT have a causal relation.

Reference:

http://bayes.cs.ucla.edu/jp_home.html

--

--

Jesse Jing

CS PhD student at ASU pushing the frontier of Neural Symbolic AI. We host the publication @Towards Nesy Try submitting your technical blogs as well!