Confounders and lurkers- the not-so-silent killers

Emily Glauser
Aug 22, 2017 · 3 min read

“There are 3 kinds of lies: lies, damned lies and statistics”- Mark Twain

When you hear “ice cream consumption coincides with a higher instance of shark attacks”, if you’re not skeptical, you’re maybe not listening. But the data proves this statement is correct. There are more instances of shark attacks when there is more ice cream consumed nationally. No doubt about it, these two variables increase in tandem.

Although this is technically true, it’s pretty we’re missing something. Sharks have very little to do with ice cream (at least what we know of them). What we’re dealing with is a confounding variable. When variables are associated with a response but are also related to eachother, that’s when we need to be aware of confounding. What is the confounder here?- summertime. Of course shark attacks increase in the summertime, more people are vacationing and swimming in the oceans which just so happen to be where sharks call home. And what goes better with sharks (I mean, summertime)? Ice cream!

This example is an easy one. No one in their right mind would think that shark attacks increase because of ice cream consumption. The argument that blood thirsty sharks are actually just interested in scoop of butter pecan is shifty at best. So how do you identify a confounder that’s not as blatantly obvious?

That’s when we deal with something called a lurking variable. A lurking variable isn’t something that you measured in your analysis, meaning it can sneak up and ruin everything if you don’t identify it. Let’s take our same example of sharks being attracted to ice cream. What if we found out that the lurking variable is a type of fish that hates hot weather and migrates north in the summertime. This fish happens to feed the shark population along some of the busiest beaches along the coast. These observed sharks are hungry because their usual prey has migrated for the season, leaving only humans whose bellies are full of ice cream to chow down on. A wild example, but still, an example of a lurking variable nonetheless.

So how do you avoid this madness that occurs under your nose during these analyses? You try to control for as many of these factors as possible. Look at only one beach, during the same 2 months of the year every year. Take into account the weather patterns, the populations that visit the beach, whether there was a World Series that happened during that time. How many new hotels were built, the local factories dumping their toxic waste in the ocean- any and all variables that could impact your data deserve recognition if not inclusion in your findings. It’s not always possible to account and control for all of these variables. Predictions based on samples can always be wrong due to these influencing relationships we don’t see. But that allows us to catch them later.

When performing your analysis or regression or any EDA, it is important to understand if the variables you are using to observe a response or outcome are associated with eachother or not. This will heavily impact your findings and could lead you to make a mistake that leads to ice cream being banned from any beaches that may have sharks around.

100,000 Hours

Mapping the road to “Data Scientist”

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade