Sticky-tar effect: What data scientists can learn from road repairs

Mohammed Noor
5 min readAug 17, 2019

--

There’s a lot more to this picture than what meets the eye

This is an incident going back to August 2018. When I was a student at Emory getting a degree in Analytics. Most of my MSBA class was in the initial days of model building and we had just learnt a thing or two about how highly correlated variables can lead to accurate but misleading results. My professor gave us a really interesting example in order to drive the concept home. Here’s how it unfolded.

One fine morning, during class, my statistics professor exclaims “Isn’t it a great morning? Warm sun and bright light all around!” Everyone nodded in approval. “But you see the problem is, Atlanta doesn’t have this kind of weather all year round. During the winters it can get really cold. In fact some winter mornings could be as cold as -5 degrees Celsius” he added. For a South Indian, these weather conditions are unfathomable. I wondered how many layers of winter wear I’d have to wear. For those who of you who might be wondering, yes, for us South Indian ‘freshers’ one might just not be enough.

A snow man with a sign that reads florida
Surprisingly, Florida has a climate that is very much similar to South India. I heard they even grow rice there.

Anyways, continuing with our story. “I’m not a big fan of the winter in Atlanta. I wish there was a giant thermostat somewhere in the middle of town” the professor added. “In fact one day I was very close to finding a solution to all the suffering that winter brings. You see, down the road near the parking deck, there’s a small pothole at the turn which is older than some of the people in this room. It’s been filled at least a 100 times in the last 25–30 years and yet it still needs repairs every few months. And as a result there’s relatively fresh tar there most of the time” And I recalled seeing this pothole for myself. It was pretty decent in size and quite easy to spot.

And the professor continued “So when you drive over this pothole on a hot summer afternoon, some of the tar sticks to your tires. I have a name for it, “Sticky Tar”. And whenever I have sticky tar on my car, I know today is a really hot day”. At that point, I bet most of my classmates were trying to work out how this example had anything to do with statistics. Well, at least I was. “One mildly cold morning” the professor continued “I drove over the pothole and felt the tar was solid. Well I guess summer is officially over, I said to myself” the professor added. “But then it hit me. What if I pull out a Blow torch and heat the tar? Every single time it’s hot outside, the tar is sticky…so if I can heat the tar and make it melt, will it be summer again?” And everyone burst into laughter.

“As stupid as it may seem, this is what some data scientists and statisticians can set out to do in the real world. They encounter sticky tar and get misled

What the professor is referring to is the age-old problem of correlation and causality. To explain it further, consider sticky tar, there is a high amount of correlation between the temperature on a given day and the “stickiness” of the tar. The higher the temperature, the stickier the tar becomes. But it would be foolish to imply vice versa. No matter how hot or cold the tar becomes, It’s got no ability to regulate the temperature of the surrounding atmosphere.

But as conspicuous this mistake may seem, it is one of the most fatal errors a statistical or machine learning model could suffer from. So If I were making a model to predict the temperature on a given day and “stickiness of tar” is one variable I include in my predictors, then the high amount of correlation this predictor has would cause my model to assign a larger weight to this predictor. In simple words, what would happen is my model would learn to correlate the ‘stickiness of tar’ with the average temperature of any given day. And because it makes logical sense that on a hot day, the tar would be stickier. the accuracy of my model would also be really high in most cases and this fatal error can slip under the nose of our unaware data scientist.

And if one day, the construction worker finds a permanent solution to fix the pothole and it’s now as flat and sturdy as any other part of the road, then on that day, even if it’s a blistering 95 degrees outside, my model would think it’s December again. Or what if there has been a repair during winter and fresh tar has been laid? In such a scenario, my model would mistake a freezing cold winter morning for a hot summer afternoon.

A classic case of correlation misunderstood for implying causality. This is why business knowledge of the problem you are trying to solve is really important

And It is imperative that each variable be carefully vetted before being included in the model. To see how misleading correlations can be, check out these interesting charts got from this amazing blog.

95% correlation but 0% causality

Here’s another one

Damn…technology seems to depress a lot of people

If you are still here, I’m glad to know that I didn’t put you to sleep. I hope you found the article interesting.Please drop a comment below. I’d love to know your thoughts. Also go ahead and share the article in your network. Who knows whose day sticky tar might save some day.

--

--