Google — Spurious Correlations

Analytic Deception — Two Visually Correlated Lines

Or How One Of The Most Compelling Analytic Graphics Is Destroying Science

Published in
5 min readSep 18, 2017

--

Everyday my Twitter and Facebook feed is filled with “proof” in the form of two correlated lines on a graph. While this may be proof that I need better ‘friends’… actually, it is not even that. It may be anecdotal evidence that I need better ‘friends’. It is certainly anecdotal evidence that my ‘friends’ fall prey to anecdotal evidence. What it certainly is NOT — is proof.

You Can Correlate Anything!

This is doubly true when you rely solely on visual methods of comparison. Seriously — the examples I found it just a few minutes with a Google image search and the keyword “Spurious Correlations” barely touch on this issue. Most of the examples above fall into the category of coincidental and exemplify the problem with “big data” and people’s limited understanding of “statistical significance”.

The lower right graph “adds” a new twist. If I am “goal seeking”, I can start aggregating one of my populations. In this example — adding steam, hot vapors, and hot objects. It seems reasonable enough on the surface — but did they exclude hot liquids? If so… still so reasonable?

People Are Easy Distracted… Especially Visually

Many of you will miss my last comment because you were too distracted by Miss America. In that case for logical reasons, but I can toss an image over here on the left and compound the problem.

Visual correlation is an abstraction. It is visually stimulating, but that is precisely because people either gloss over the details or focus in on the wrong things. Data Visualizations also have their own analogies of heels, make-up, and hair extensions.

Miss America is actually far more pure than most underlying data. We need a better analogy!

Digital Manipulation

Visualizations are data model. With today’s technology — so are the models paid to step in front of a camera. This is an old story. You have likely seen the before and after videos all over youtube, vimeo, and whatever else kids are watching these days. Photoshop can turn anyone into a supermodel… or Santa Claus. Time series data is even easier to alter!

How is it done?

When I present you with two lines on a graph, I am obscuring a world of potential manipulation. Check that — I am not. When someone present you with two lines on a graph, they are likely obscuring a world of potential manipulation.

First, your focus (by default) is now on the pattern of the line. Not on all the more important components of the analytic exercise which are, at best, minor footnotes and labels and at worst, left out entirely.

  • Someone has selected these two sets of underlying data.
  • They defined a population to measure.
  • They defined a measurement.
  • They defined a sampling or collection technique.
  • They filtered, weighted, averaged, or aggregated.
  • They chose a start and end point.
  • They chose units, rates, positive and negative positioning.
  • They chose the axis types and levels of distortion.
  • They chose whether to ignore or utilize lagging, smoothing, and curve fitting.
  • They chose the time series units.

The Question is How and Why?

And the answer is — you have no idea.

Let’s use one of our examples. Try not to be distracted by Nicolas Cage. I know it is hard.

It is easy to question why someone would even try to correlate drowning and Nicolas Cage films. Some will be tempted to go cynical here, but let’s try to go deeper. Ask how and why?

  • Why just pool drownings? And is that global?
  • Why ‘falling into a pool’ that is oddly more specific? Were filter accidents and diving mishaps filtered out?
  • Why yearly? Lines imply continuous — yet nearly all line graphs are actually discrete. If this were quarterly or monthly — what then?
  • Why appeared in? Why not starred? Honestly, it is likely ‘credited for appearing’ in. Remember data collection…
  • Why wouldn’t total film time or box office success figure in? It is not just how the metrics are defined but whether they make sense relative to each other and the logical connection being proposed (there is none here).
  • Why set the y-axis minimum to 80? Is this a magic minimum number of drownings or should we believe that if Hollywood removed Cage from films more people would live? Any graph axis with a minimum not equal to 0 is suspect.
  • Was it the year he filmed or the year it was released? That would be important if we gave this graph any merit…

This is just the beginning…

Analysts with poor competence or bad intention have plenty more ways to lead you astray. Many graphs provide bold footnotes backing the source of data for one line, but oddly… not the other. Some change sampling methodologies midway through (a NEVER).

And don’t think this is all nimbly averted with some statistical or mathematics equations. I can manipulate correlation score almost as easy — strike that, even more easily. This story goes on — but our articles aren’t allowed to.

For more on this topic consider:

Or just stay tuned — more articles are on their way! Thanks for reading!

--

--

FKA Corsair's Publishing - Articles that engage, educate, and entertain through analogies, analytics, and … occasionally, pirates!