Correlation vs causation — a data analyst’s perspective

5 min readJul 6, 2019

If something works in theory, a data analyst would be interested in whether it works in reality.

You have probably heard of the term — ‘correlation is not causation’. It is a widely used disclaimer in social science when the researcher dares to generalize the observed association from data into a theory about our reality. It is an incredibly difficult task, hence the need for a disclaimer.

A quick search on Google is going to land you on plenty of reasons why correlation does not mean causation. The most apparent reason is called ‘coincidence’ — the worry that things happen to happen together does not always occur together; or ‘omitted variable — the chance that there exists a common cause that is unobserved but truly responsible for the correlation; or ‘reverse causation’ — the uncertainty around whether A caused B or B caused A or both. The problem with these arguments is that they assume everyone is already on the same page in terms of the definition of ‘causation.’ No. If you ask people around you what does it mean when we say A causes B, you will be surprised at the variety of the answers. Some might say that A causes B if B is bound to happen after A happens; or A is contributing to B；or the probability of B happening is affected by A — then you would run into similar problems regarding the definition of “probability,” “affected,” or “contribute,” etc.

What do we mean by causation？

It is not that complicated — causation can only be assumed while correlation can be estimated. It might sound useless at first. Of course, every piece of human knowledge is assumed because we can never be so sure about this world, so what? Well, it is still essential to understand whether we are dealing with the uncertainty or the validity of something — because you cannot deal with both at the same time. The real world has uncertainty built-in; we cannot just assume it away. That leaves us with the only option to assume the validity of a theory until evidence suggests otherwise — this type of thinking is what made René Descartes the father of modern western philosophy — science is ultimately the process of proving or disproving various hypotheses about this world using evidence and inferences. Therefore, causation is no more than a hypothesis waiting to be proved or disproved by evidence. However, we have to be careful about how to come up with the hypothesis and how to establish the evidence.

The structural approach

To come up with the hypothesis properly, we need what is called a structural approach. It starts with some general assumptions as indisputable as possible, and use mathematical logic to deduce the end relationship between X and Y. That deductive logic guarantees the validity of the result as long as the assumptions are valid. A typical example is the study of geometry. We start with simple assumptions like “the closest path between two points is a straight line,” or “two parallel lines never cross,” but get to really complicated and useful results that have various applications in real life. In other words, the structural establish rules based on rules. The purpose of having a structural approach is to 1) reduce the assumption we need to establish a rule; 2) to extend or generalize the established rule to a wide subject of interest. The caveat is that you always need to start with some assumptions, and the validity of these starting assumptions is crucial. Unfortunately, not many assumptions can be regarded as self-evident in the real world.

The structural approach is deductive, i.e. it goes from general to specific. So it has global validity. Its validity relies on the validity of its starting assumptions and the correctness of the mathematical deduction/proof. The structural approach emphasizes the existence of a causal mechanism in theory — whether the relationship can be accurately quantified is secondary. A typical mistake with the structural approach is to use the mathematical validity of the deductive proof to endorse the existence of the causal relationship, without acknowledging the dependence on the initial assumptions.

A structural model, no matter how complex, can fall apart completely if the starting assumptions are proved wrong

The empirical approach

Next, we need the empirical approach to ‘test’ whether the hypothesized relationship/rule can be reproduced or captured in a controlled environment — such an environment is usually a lab. However, there are so many factors one can not control — especially in social science. So people come up with the quasi-experimental design that makes all other factors ‘random’ — the argument is that if the average relationship between X and Y is certain when everything else is happening randomly, then the same association can be observed if we were able to control the uncontrollable factors in a lab set up. In that case, the golden rule of causation is a randomized control experiment. The term ‘randomized’ need to apply to everything. For example, in a typical test for the effectiveness of medical treatment, the doctors cannot know which patient is treated; the patients cannot know whether they are treated — to make sure that the assignment of treatment and control group is random.

The empirical approach is inductive, i.e. it goes from specific to general. It has only local validity. Its validity relies on the statistical significance of the estimated relationship but conditional on all third-party factors being fully controlled or ‘randomized’.The empirical approach focuses on whether we can obtain an accurate measure of the relationship in a controlled (randomized) environment — whether the causal mechanism can be derived from a theoretical perspective is secondary. A typical mistake with the empirical approach is to use the accuracy of the estimated statistical relationship to approve the existence of causality, without mentioning the ability to control or randomization of all third-party factors and disentangle reverse-causation.

A piece of empirical evidence, no matter how stable and strong it appears, cannot be extrapolated too far away from the experiment domain

Summary

Ideally, we would require both structural hypothesis and empirical evidence to establish a causal relationship — like how the blackbody radiation function was initially obtained from an empirical approach (i.e. derived formula to fit experiment outcome) but later to be proved as a product of the quantum theory. In real-world data analytics, most assumptions are formed `ad-hoc`, i.e. not from a structural point of view; and most so-called ‘empirical evidence’ have poor control over third-factor variables or have small sample size. In these scenarios, a clear understanding of the structural and empirical approach is going to be very helpful for a data analyst to navigate through causation and correlation.

Correlation vs causation — a data analyst’s perspective

What do we mean by causation？

The structural approach

The empirical approach

Written by Tony Liu