One of the most commonly repeated maxims in science is that correlation is not causation. And there is no shortage of examples demonstrating why. One of the most famous is the case of hormone replacement therapy, which was studied by numerous epidemiologists at the end of the last century.
These studies showed that women who took hormone replacement therapy had less chance of developing heart disease. Naturally, doctors suggested that hormone replacement therapy somehow protected against heart disease.
That turned out to be an erroneous conclusion. Later studies showed that women who took hormone replacement therapy were likely to be from higher socio-economic groups with higher incomes, better diets and generally healthier outcomes. It was this that caused the correlation the earlier studies had found. By contrast, proper randomised controlled trials showed that hormone replacement therapy actually increased the risk of heart disease.
In the absence of controlled trials, statisticians have widely assumed that it is impossible to determine cause and effect from an observed correlation alone. Does Y cause X or X cause Y? The apparent symmetry of this scenario seems to exclude the possibility that any statistical test could tease them apart.
But in the last few years, statisticians have begun to explore a number of ways to solve this problem. They say that in certain circumstances it is indeed possible to determine cause and effect based only on the observational data.
At first sight, that sounds like a dangerous statement. But today Joris Mooij at the University of Amsterdam in the Netherlands and a few pals, show just how effective this new approach can be by applying it to a wide range of real and synthetic datasets. Their remarkable conclusion is that it is indeed possible to separate cause and effect in this way.
Mooij and co confine themselves to the simple case of data associated with two variables, X and Y. A real-life example might be a set of data of measured wind speed, X, and another set showing the rotational speed of a wind turbine, Y.
These datasets are clearly correlated. But which is the cause and which the effect? Without access to a controlled experiment, it is easy to imagine that it is impossible to tell.
The basis of the new approach is to assume that the relationship between X and Y is not symmetrical. In particular, they say that in any set of measurements there will always be noise from various cause. The key assumption is that the pattern of noise in the cause will be different to the pattern of noise in the effect. That’s because any noise in X can have an influence on Y but not vice versa.
So the datasets should reflect this. The task for a statistician is to develop a statistical test that can tell the difference.
Just such a test exists and known as the additional noise model. This assumes that each dataset is made up of the relevant data as well as various sources of noise. Statisticians have shown that the nonlinearity of this process can allow them to determine the direction of cause-and-effect.
Now Mooij and co have tested how well this works on a wide variety of different datasets that they have compiled for just this purpose. Each dataset consists of samples of a pair of statistically dependent random variables where one variable is known to cause the other. The challenge is to identify which of the variables is the cause and which the effect.
In total, Mooij and co have collected 88 datasets of cause-and-effect pairs from more than 30 different areas of science. For example, they have measurements of altitude and mean annual temperature for more than 300 weather stations in Germany. Mooij and co say it is obvious that altitude causes temperature rather than the other way round but determining this from the data is far from straightforward.
Another dataset relates to the daily snowfall at Whistler in Canada and contains measurements of temperature and the total amount of snow. Obviously temperature is one of the causes of the total amount of snow rather than the other way round.
Another dataset relates to the cost of rooms and apartments for students and consists of the size of the apartment and the monthly rent. Again it is obvious that the size causes the rent and not vice versa. In total there are 88 different datasets along with the ground truth knowledge of which variable is the cause in which the effect.
Mooij and co then use the additive noise model to work out which of the variables in each case is the cause and which is the effect.
The results make for interesting reading. They say the additive noise model is up to 80 per cent accurate in correctly determining cause-and-effect. And they say that the method is robust against small perturbations of the data that can arise from the way it is handled statistically. “Our empirical results provide evidence that additive-noise methods are indeed able to distinguish cause from effect using only purely observational data,” they conclude.
That’s a fascinating outcome. It means that statisticians have good reason to question the received wisdom that it is impossible to determine cause and effect from observational data alone.
It’s worth pointing out that this applies only in the very simple situation in which one variable causes the other. But of course there are plenty of much more complex scenarios where this method will not be so fruitful.
Nevertheless, this is likely to be a powerful new addition to a statistician’s armoury. There are many situations in science where controlled experiments are simply not possible because they are too expensive, unethical or technically impossible. In those situations, the additive noise model could be a game-changer.
Ref: arxiv.org/abs/1412.3773 : Distinguishing Cause From Effect Using Observational Data: Methods And Benchmarks