Five Misconceptions about Data Science

Data science has made its way into practically all facets of society — from retail and marketing, to travel and hospitality, to finance and insurance, to sports and entertainment, to defense, homeland security, cyber, and beyond. It is clear that data science has successfully sold its claim of “actionable insights from data,” and truth be told, it often delivers on that claim, adding value that would otherwise go untapped. As a result, data science is often looked to as a panacea, a Swiss army knife, a silver bullet, a must-have, [insert your own cliché here]. This has implications for both data scientists and the organizations they work with. On one hand, data scientists are now beginning to face a new set of challenging problems, problems that even the most advanced machine learning algorithms have yet to solve: managing expectations. And on the other hand, many businesses and organizations are grappling with shifting learning curves, the latest shiny object, and the pressure to keep pace. As the data science bandwagon fills up, there are many individuals that do not fully, or even marginally, understand what data science is, what it can do, and when it is relevant. In what follows, I present what I have encountered to be five of the most common misconceptions about data science — misconceptions that will proliferate and morph as the data science wave rolls on. Recognizing these misconceptions, and avoiding the pitfalls associated with each, will go a long way toward empowering you (and your organization) when it comes to “deriving value from data.”

The interchangeable use of the terms “data science” and “big data” is not uncommon these days [1]. It could be argued that the so-called big data revolution provided the impetus for the field now labeled as data science. Regardless of the origins of their entanglement, big data and data science are quite different. Big data refers to the collection, managing, and processing of incredibly large amounts of data (terabytes, at a minimum). But the idea of big data goes beyond just a lot of 1s and 0s, which is why it is more properly characterized by the “Three Vs” — volume, variability, velocity. In addition to sheer quantity, big data often consists of different types of data (structured, unstructured, numeric, textual, imagery, video, and so on) [2]. And, data can become “big” when the rate at which it is generated and requires handling becomes excessive. Take Twitter, for example: a single tweet is only a few hundred bytes (the 140 characters), but considering 350,000 tweets are sent per minute (on average) [3], you quickly have a big data issue.

In contrast, data science deals with everything from the mining, transforming, modeling, and storing of data, to exploring and analyzing data, to building models and algorithms around data, to visualizing and interacting with the results. Big data should be thought of as an aspect of data science — it describes the situation where the data involved is characterized by one or more of the Three Vs.

When organizations talk about needing big data solutions or big data technology, often, what they really need is a data-science solution. Most businesses do not have petabytes of data. Quite the contrary — many businesses are able to work with their data using Excel, which has traditionally been a satisfactory mode of operation; however, with the explosion of data collection and data availability, the challenges that we see today are not so much the amount of data, but rather, the variety of data. Data (much of it unstructured) is becoming more heterogeneous, and it is often scattered across various systems (some old, some new). There is also the reality that data can be incomplete, inconsistent, and even plain wrong. Rather than dealing with big data, organizations are wrangling with “non-traditional,” “messy,” or “difficult” data, and being able to work with these less-friendly data incarnations has become the real challenge.

Posted on