Analytics. Statistics. Data Science. Medical Research.
Wet Pavement Causes Rain
Association and Causation
Background
There is a famous maxim that is used quite frequently in the world of data analytics:
Correlation is not causation.
Or is it?! … Or is it close enough?!
You may think that this is a well-worn topic, and it is, but I wanted to put my voice into Medium on this dictum because misconceptions and misuses about it surface on a regular basis. In fact, perhaps there is a growing movement that does not believe this maxim to be true. For example, a Partner and Data Science expert from a major consulting company visited Lilly when I worked there just a couple of years ago and stated emphatically, “We do not need causation anymore. Correlation is enough with big data.”
In my first article (Analytics, Data Science and Statistics) I stated that
Statistics is the science of quantifying what is likely to be true.
Who doesn’t want to know the truth?!
- Do amyloid-beta protein tangles in the brain cause Alzheimer’s Disease?
- Did an advertising message cause an increase in sales last month?
- Are these the correct variables/descriptors to put into a model to give reliable predictions of flu outbreaks (or in a more contemporary setting COVID-19, though I will resist jumping on that bandwagon)?
So, let’s dig in with a thoughtful examination of this issue. I will first start with a story.
A Personal Illustration
This story starts with an enormous database at my house. I like to measure all sorts of things — as much as I possibly can — about my house and my surroundings. Everything imaginable. The database includes digital readings of temperature inside and out, energy consumption, individual lights on and off, a motion feed from my Nest, how my lawn is doing, whether my downspouts are leaking, etc. Everything digital I have and more. I then pull in contemporaneous other information on local weather and other environmental factors (e.g. sunrise/sunset, humidity, pollen counts, etc.). Finally, I include information about my side business (a small local retail store) and some other neighborhood information on crime, car accidents, etc. Ok, you get the idea. It turns out that there are 4723 measurements, variables, features (or whatever your branch of Analytics likes to call them) in the database. OK, now here’s the nerdy part … I have done these measurements every 5 minutes for every hour of every day for the last 10 years. That’s 1,051,200 independent observations in the database!
When analyzing that data, I find many very interesting correlations (i.e. associations), patterns and relationships in the data. For this article I will share one cluster of such correlations/associations with you.
- Green grass on my lawn is associated with wet pavements.
- Umbrella sales at my store are associated with my green grass.
- Umbrella sales are also associated with car accidents around my neighborhood.
- In fact, all these observations — wet pavement, green grass, umbrella sales and car accidents — are associated with each other.
So, armed with these very clear and compelling associations, I decided to put umbrellas on sale at my store and take a loss on that item as a public service to reduce the number of car accidents in my neighborhood.
Furthermore, I petitioned out local government to put in an ordinance that would enforce me and my neighbors to use less water on our lawns in the summer so that the grass would turn brown and the sidewalks would be dry … thus reducing car accidents in our neighborhood. The local government was suspicious of my idea and recommendation, but they invited me to the next meeting to explain.
At the next Township Council meeting I showed the Council that I did not come by these associations lightly. I have a very large database, so I explained the power of splitting up large datasets so as to weed out the spurious findings from the real (i.e. “true”) findings.
- 600,000 randomly selected observations for exploring my data
- 200,000 for testing and
- 200,000 for validation of my findings
And, yep, sure enough the associations from my quick and dirty analysis of the exploratory data set held up throughout validation!
Moreover, and here’s the amazing thing, my neighbors on either side of me agreed with me 10 years ago to start collecting this data. So, I went to their independent databases and looked at the associations. Amazingly, I found the same associations with the same correlation coefficient in BOTH of their datasets. After such in-depth and compelling arguments — and much to the chagrin of all my neighbors — the water restriction ordinance was passed by our Township Council to improve public safety.
NOT SURPRISINGLY, I lost money on my umbrella sales at my store and the neighbors’ lawns looked a mess by August … but SURPRISINGLY, the accident rate in our neighborhood remained the same!!!
What the …?
OK, I should have written this story on April 1st! Perhaps the fact that Nest is not quite 10 years old was a tip-off that the story was … well, just a story.
The above story is trite. It is obvious that the underlying cause for green grass, wet pavements, umbrella sales and car accidents is rain. Tinkering with green grass, wet pavement and umbrella sales and expecting a different outcome on car accidents is ludicrous. No matter how fictitious, this story does highlight why I am skeptical whenever I read about an association. There is an abundance of such articles, even in esteemed scientific journals, which have, presumably, as their raison d’etre the discernment of truth — cause and effect.
Side Note 1: Understanding “independent.” I have seen articles published that make statements like, “6 million independent observations on 70,000 patients.” Obviously, there are repeated measures on the same patients and these observations are correlated, not independent — WHICH MATTERS A WHOLE LOT in how to do the analysis and how to interpret the findings. Also, the “independent” databases from my neighbors are clearly not independent since we share the same local environment etc. Again, using a separate database for confirmation of a finding may not really be an independent confirmation of that finding.
Finding Associations
During a visit to my company by a very reputable Professor at a major US university who is widely recognized in data science, especially as applied to healthcare, that Professor made the following statement: “Give me a large enough dataset and I guarantee I can find the patterns in it.” Some of my colleagues shrugged their shoulders, and rightfully so. Lots of us can do that … and quite easily. But that is neither the question nor the answer we seek.
Playing off my first article in Medium, I will use the following couplet:
Finding associations is very easy.
Proving cause and effect is very hard.
There is no way I can investigate all of the “associations” I see published, but on occasion I do perform a deep dive just out of my own curiosity and frankly out of my own amazement with “how does this stuff get published?” What I also see is that somewhere in such “association” articles there is a rightful mention that the study findings may not imply cause and effect. Yet, I also see vast numbers of such publications with an often (not so) subtle conclusion of cause and effect.
Side Note 2: If you are interested in some of my deep dives, they can be found at my blog AnalytixThinking.Blog. I will note that they are not highly mathematical with equations and complex derivations, but they are more statistical in terms of their inferential logic, which I recognize can stretch some people’s brains more than a dizzying array of mathematical formulas.
Example 1
The Lancet, Psychiatry published an article entitled “Association of disrupted circadian rhythmicity with mood disorders, subjective wellbeing, and cognitive function: a cross-sectional study of 91,105 participants from the UK Biobank.” https://www.thelancet.com/journals/lanpsy/article/PIIS2215-0366(18)30139-1/fulltext)
The key findings of the study and the concise summary provided by the authors states:
“Interpretation: Circadian disruption is reliably associated with various adverse mental health and wellbeing outcomes, including major depressive disorder and bipolar disorder. Lower relative amplitude might be linked to increased susceptibility to mood disorders.”
Poor sleep is reliably associated with adverse mental health.
Which comes first?
My (brief) perspective after studying this article:
Let A = circadian disruption (quantified by activity/motion/amplitude) and B = adverse mental health and wellbeing (they examined a long list of mental health conditions and performance measures). So, A is associated with B. However, the use of the word “outcomes” with B gives an implication that A comes first and then B is the “outcome,” which connotes a causal link. The Interpretation could have just as reasonably been written to say B is associated with A; that is, adverse mental health and wellbeing is associated with circadian disruption (outcomes?). And in fact, there may be a third, unmeasured, separate factor (e.g. some personal trauma like a death in the family) that causes both mental problems and circadian disruption, and there is no causal link at all between A and B.
There are more statistical issues with this article that also apply to “association studies,” but for the sake of brevity I will defer them here and refer you to my Blog №2 “Association, Correlation and Causation.”
Example 2
PLoS Medicine published an article entitled “Association of moderate alcohol intake with in vivo amyloid-beta deposition in human brain: A cross-sectional study.”
https://www.ncbi.nlm.nih.gov/pubmed/32097439
[Note 3: SD = standard drinks, which is a way to equilibrate beer, wine and spirits into standard alcohol consumption units.]
[Note 4: Amyloid-beta (Aβ) is a protein that accumulates in the brains of Alzheimer’s Disease (AD) patients.]
Now, read on.
The main conclusion of the study is as follows: “A moderate lifetime alcohol intake (1–13 SDs/week) was significantly associated with a lower Aβ positivity rate compared to the no drinking group, even after controlling for potential confounders (odds ratio 0.341, 95% confidence interval 0.163–0.714, p=0.004).”
[Note 5: The NYT published this online headline: “Moderate Drinking Tied to Lower Levels of Alzheimer’s Brain Protein,” with their take-away message of, “Compared with abstainers, those who drank up to 13 standard drinks a week had a 66 percent lower rate of beta amyloid deposits in their brains.”]
Alcohol intake is significantly associated with Alzhemier’s brain plaques.
Or … Does Alzheimer’s drive me to drink?
My (brief) perspective after studying this article:
As with many authors of association studies, these authors state, “causal relationships cannot be inferred from the findings.” Yet they also state, “the present findings … suggest that moderate lifetime alcohol intake may have some beneficial influence on AD …” The phrase “beneficial influence” is a direct implication of cause and effect. And elsewhere they state or imply a cause and effect relationship, “the protective effects of moderate alcohol intake against Aβ pathology involve the chronic effects [of alcohol] associated with long-term exposure.” The phrase “protective effects of moderate alcohol consumption” presumes case and effect. Lastly on this point, if there is a cause and effect relationship, I never know which direction the causal arrow points. Does better cognitive/brain health allow one to socialize more, resulting in more drinking?
There are many logical shortcomings in this article. For example, the authors conclude that there is no effect of 0 or 1 SD/wk, then a dramatic effect of 1–13 SD/wk, and then no effect of excessive alcohol, 14+ SD/wk. Seems unlikely … and by the way some definitions of alcohol abuse syndrome include 7 SD/wk for women and 14 SD/wk for men!
There are also statistical flaws; for example, the low alcohol consumption group has only 16 people in it! ANY scientific or statistical inference on such a small sample lacks credibility, let alone on such a complex phenomenon as AD. See my Blog #16: Beware Caesar for more details.
Example 3
And then, in 2009, there is the article in the prestigious journal Nature entitled, “Detecting influenza epidemics using search engine query data.” https://www.nature.com/articles/nature07634
The primary conclusion of that publications was “… [by] analyzing large numbers of Google queries … the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with flu-like symptoms, we can accurately estimate the current level of weekly influenza activity …”
Google queries correlated with physician visits for flu.
Are Google queries predictive of flu outbreaks?
My perspective on this publication:
Well, I didn’t need to do an in-depth analysis. Others beat me to it. Ultimately, the Google Flu models did not do a good job of prediction. In the equally prestigious journal Science in 2014, some other researchers published, “The Parable of Google Flu: Traps in Big Data Analysis.” They state, “Large errors in flu predictions were largely avoidable, which offers lessons for the use of big data.” Furthermore, these authors noted, “GFT [Google Flu Trends] overlooks considerable information that could be extracted by traditional statistical methods.” It’s worth a read by anyone who is doing Analytics, whether it be Statistics, Data Science, Econometrics, modelers of all types, etc. https://science.sciencemag.org/content/343/6176/1203
Interestingly, as of this writing, when doing a Google search of these articles, the reported citations are 3928 for the original article and 1892 for the latter. Apparently, there are 2036 researchers who still believe the original work.
Lastly, just to give some idea how prevalent “association studies” are (at least in the medical literature), I will point you to the October, 2019 issue of JAMA Pediatrics in which half of the research articles and research letters have the word “association” in the title!
I could go on, but you get the idea.
Conclusion
Coffee consumption is associated with longer life. Or was it shorter? Or was it red wine? Or eggs? I’m confused. Anyway, …
There is absolutely a place in the Data Analytical Sciences for exploratory analysis and the freelance search for patterns in data. Such endeavors can help the scientific and business community to gain potential insights into our world in the pursuit of “the truth.” However, they should be viewed very skeptically and carefully. Perhaps one of the most celebrated associations that turned out to be completely meaningless was the relationship between ice cream consumption and polio in the late 1940’s (see AnalytixThinking.blog on Associations, Correlations and Causation). The modern version of this is playing out before our very eyes with the question as to whether the amyloid-beta plaques in the brain are the cause of Alzheimer’s Disease or the effect of some other underlying pathology are. See this article on a very recently failed clinical trial and its implication for the beleaguered amyloid hypothesis.
If Data Analytical Scientists want to put data to good use, producing lots and lots of associations is not very helpful. Producing a few good leads as to what might be useful in the data is a start, but as I noted in my first Medium article (Analytics, Data Science and Statistics), quantifying (or at least qualifying) the likelihood that the associations are real and also actionable in terms of cause and effect is imperative. We should convey some understanding of potential cause and effect relationships or otherwise we will be advising our business or science partners/colleagues to put umbrellas on sale to reduce car accidents.
It is the pursuit of what is “true” that inextricably links Statistics, and I believe all areas of Analytical Sciences, to the scientific method, which is fundamentally about discovering cause and effect relationships in our natural world. If we call ourselves Data Analytical Scientists, then we should live up to the name. If we do not pursue cause and effect and settle for the easy way out of describing associations, then we might as well be relegated to the back office and called “number crunchers.”