Avoiding Data Blindness: Balancing Rigor and Common Sense To See The Big (Data) Picture
“And you’ll see from these data points that this organization is fundamentally fractured and significant cost savings can be found through resource consolidation and streamlining.” I stared at the assortment of rainbow colored bar graphs projected in front of me, while the smartly dressed presenter continued on. “Each color represents a different activity and each bar is a profile grouping of resources we identified through analytics and correlation.” Being color blind, I squinted and had a little trouble telling apart all the different shadings in the bars. “In a lean, streamlined organization, we would expect solid color bars, so you can see the amount of fragmentation and waste present here.”
I raised my hand at his next pause. “So what do these groupings and colors represent?” I asked. After he restated that they were groupings derived through analytics, I quickly replied. “No, I mean, what would you name these groupings? Who are these people? How truly different is each activity or color, and why, based on that context, do you think they can be streamlined?”
I don’t remember what his complete response was after that, but it didn’t answer my question. The one thing I do recall clearly was the refrain “well, this is what the data shows” repeated to exasperation. We didn’t proceed with further work in the org redesign space with that client. In my mind, it was yet another example of bottoms-up data analysis seeking to match a top-down definition of a problem, but with not a whole lot in between.
Out of Insight
To be clear, I’m no data analysis expert. In fact, I’m a data autodidact. But I’ve been working long enough to recognize the traps that lay themselves before you all too often in real world application. While we all want to be “driven by data,” it can also be tempting to teach our data to drive — holding tight to a passenger-side steering wheel the whole ride. Or, worse, to hand over the keys without giving them the foggiest notion of the rules of the road. The focus should be on using data to drive towards insight. That journey can be winding and long. The difference between a meaningful pattern, a logical arrangement of actionable information, and a frequently occurring yet random stack of data is often not so clear.
How can you tell insight from coincidence? How can you “let the data speak for themselves” when they are fundamentally deaf, dumb, and blind?
And how can you yourself avoid becoming blind to what the data may or may not be showing you?
There are three major forms of “blindness” that seem to afflict the real world application of data science:
- Snow Blindness (Data Justification)
- Sand Blindness (Result Manipulation)
- Cloud Blindness (Confirmation Bias)
Beware the glare of too much data.
The most common topic I see flowing through my various business news (i.e. promotional) feeds is Big Data. So much data is available to us. So much insight just waiting to be uncovered, if you’d only buy this product, or watch this TED talk, or certify yourself as a data science professional. While it’s very encouraging to see data analysis taken seriously enough as to seemingly get its own branch of the social sciences, it’s also not completely new.
It’s science, remember? For me, it was 6th grade when we got into lab science classes. We learned to establish a hypothesis, develop an experiment to test the hypothesis, conduct the experiment, determine if the hypothesis was proven or disproven, and then repeat as necessary. The good old scientific method. Who knew then that we could’ve just sent it to the cloud, awaited a machine learning derived response from a public API, and called it a day?
The missing pieces of the Big Data puzzle are the hypothesis and the test. There has to be a logical outcome you are looking to challenge, and a decision that hangs in the balance. If your only solution is to crunch numbers until the hypothesis appears in conveniently proven form, you’re going to be waiting a while. Or you’re going to simply mold it into whatever you’d like it to be. Or use it to fill a convenient, already formed container.
But if you start with a hypothesis and seek to comprehensively test it, you may learn more than you first expected. Say you seek to validate the standard “no two snowflakes are alike.” You will eventually discover that it is true — even at a molecular level. If you truly attempt to test that outcome, instead of simply finding the quickest way to prove it, you’ll likely uncover that while individual snowflakes are unique, there are only 35 distinct shape patterns they can form. There are no identical situations, but there are a manageable amount of common categories. Instead of standing, frozen in place, staring into a vast landscape of infinite variety reflecting sunlight into your burnt corneas, you can bound that variety and move forward. Leveraging the insight that even in the unforgiving tundra of Big Data, there can be a discrete set of data patterns to manage.
Through repetition and common sense, you can identify and test logical patterns and outcomes. But not just any hypothesis will hold. Which leads me to…
Any port in the storm, and any oasis in the sand.
I was listening to a podcast this morning and where Abbi Jacobson from Broad City threw out a quote that “intonation is not the same as melody.” There is a difference between having perfect pitch and being a musician. There is a difference between methodically arranging data and uncovering insight.
When you’re desperate for insight, you’ll see your data the same way a stranded, weary traveler sees an endless desert in front of her. Wavy heat lines emanating from barren hills of pale brown sameness, suddenly forming a glorious display of lush greens and refreshing blues as her wineskin empties.
Once she’s regained her senses, she can even begin to build her oasis from the sand. Methodically forming, molding, and recreating the shape and scope of the oasis illusion. With ingenuity and diligence, an approximation can be created with each grain placed just so.
But ultimately, this is a replica, despite the talent with which it was formed. Strong desert winds uncover its illusion with every gust. With enough motivation and skill, we can build what we like out of our data — going so far as to strengthen it into concrete or fortify it with limestone and granite. Yet when those talented resources are gone, moving on to new career travels, your department or business is left with just a bunch of sand.
No, the other kind of cloud.
This has nothing to do with Hadoop or AWS or XYZaaS or anything of the sort. This is about our imaginations. Lying in the grass, staring up at the formations of atmosphere above us and seeing what we want to see. Or seeing what we think we see. And not seeing clouds.
The difference between this and the data desert oasis, is that, in this case, there really is something here formed outside of your control. And the data appear to represent something that your cognition can match to something else and assign a meaning. True meaning is there, and there is information being communicated — the size, shape, color, velocity, etc. of the cloud — which can be used to predict some future condition or state.
But if, instead, you want to see a sheep, then you’re going to see a white, fluffy sheep.
If the management consulting firm you just hired to find savings in your budget self-selects close fitting data sets, believe me, they will find those savings. Even if it turns out to be just a bunch of warm condensed water vapor once they’re gone.
If it’s shaped like a duck, quacks like a duck, and walks like a duck, it might just be a grebe (sister species of the flamingo). If you aren’t truly testing or challenging your hypothesis, you’re always going to be left with what you want it to be or what you thought it would be from the beginning. And you’ll be left wondering why your feed budget is so overextended every quarter.
Avoiding Insight Agnosticism
I’ve been mentioning common traps of waiting for the data to present a conclusion (Snow Blind), modeling your results to create a non-existent pattern used to form a conclusion (Sand Blind), and self-selecting your data sets and degree of validation to achieve a conclusion (Cloud Blind). In each, you are left with a conclusion. But those types of conclusions are merely based on data analysis, but are not validated insights from data analysis.
No one can rely wholly on a model or an algorithm to achieve success. But being completely agnostic of a top-down or formulaic view of data, or using a purist’s perspective to avoid theory in your models, is simply at the expense of real insight. On another podcast (hey, I’ve got a long commute — don’t judge me), Nobel Laureate James Heckman discussed the challenges of relying too heavily on theory to establish predictive models against the further challenges of ignoring even established theory. The attempt to ignore precedent or blind oneself completely from the possibility of bias, is to bias yourself into missing some or all insight.
The goal is not to rely solely on models, rigor, or precedent alone to establish a conclusion. Rather to manage the balance of those ingredients, establish and test hypotheses, and attempt to control your data blindness bias as much as you can. Data science is more of an art than as science, after all.