You heard it everywhere: “Garbage in, Garbage out.” Models that are based on bad data lead to bad decisions. Yet it seems we can’t do away with bad data. How did we get here, and is the data really to blame?
Data, of course, does not materialize out of thin air. However, it appears few companies are being intentional about the data they collect, or set off with a clear purpose in mind in terms of the statistics or analytics they will apply to that data, in order to answer what questions. Instead, most “big data” in existence seems to have originated as a by-product of some internal process that gets logged to allow later error-detection, or was simply stored for regulatory purposes. And a lot of it is not even analyzed, or at least not right away. There is even a name for the phenomenon: dark data.
For some companies, the existence of dark data, after a certain point, creates a pressure to act. And that is when they start looking into data science for the first time. Two, maybe three years in, a majority of data science initiatives reportedly fail. Ask why, and the people involved will often point to limitations in the data itself.
Data is an easy target to assign blame to. After all, it can hardly defend itself. But there is also something to be said about the process, i.e. whether that data could have been put to better use and how.
What a lot of companies that are new to data science do wrong is that they ignore the number one rule of science: learning anything from any data at all starts with formulating questions. I will go ahead and reiterate that in case you missed it: learning anything from any data at all does not start with gathering data, it starts with formulating questions.
But what do you do when you’ve already gathered data, before thoroughly thinking of the questions you want answered? For one, resist the urge to jump in and formulate questions based on the data you already have. Why? There is a very high chance you will conclude haphazard data is only good for answering questions that, ultimately, nobody was interested in asking in the first place.
Instead, pretend no data is there. Make a list of the decisions you would like to be able to make as a company, or department within a company, and the questions you need to answer to arrive at those decisions. Then, try to anticipate the possible answers. Add those to the list. You will likely circle back a few times to add more decisions, more questions, and potential answers before you have a list that feels meaningful. This list will look different for a bank, a retailer, an educational institution, and it will look different for an IT department, a human resources department, a marketing department etc. That’s how you know you are on the right track.
List in hand, ask yourself: in a world with unlimited time and resources, what data would I collect to find the answers to these questions? Put that on the list. Then and only then, look at the data you have. Could you use any of it to answer (at least partially) any of the pressing questions you’ve listed? Or maybe to exclude any of the possible answers you’ve anticipated? This is what bad data is good for, and you’d have never been able to use it this way if you had let it constrain your list of questions.
Armed with partial answers, as well as exclusions in terms of what those answers could be, companies can move ahead. They can make (some) decisions. They can start implementing corrections to their data pipeline. They can make plans for supplementing the data they have with the data they’re missing. Finally, they can start collecting the data they had no idea they needed in order to answer the questions they now know are worth asking.
Where to start? The process I’ve described won’t run itself. Put someone in charge of seeing it through to success. Your best bet? Someone with at least master-level scientific training, and who speaks fluent business.