The Data Science Delusion

Prologue

Four years ago, having earned my living as a programmer/researcher for over a decade, I was co-opted into the data science movement. Since then I have witnessed technical misunderstandings between business leaders and data scientists (and among data scientists themselves) quite unlike anything I had seen earlier, and some projects have ended with so unexpected a notion of business or scientific success that I started wondering whether there is something wrong with data science as it is being practiced.

Discussions with several colleagues convinced me that I was not alone in my thoughts. And though not easy to find in the deluge of hype online, more than a few eminent voices have raised questions and concerns about data science. In this article, I have tried to tie together these strands of contrarian thinking with some perspectives of my own, in an attempt to explore the reasons for what could be termed a delusion in many parts of data science.

The Fundamental Problem

First, let’s get out of the way what is unexceptionable: being data-driven is beneficial [1]; it is necessary now to compete on analytics [2]; and there is a set of technologies (big data platforms, scientific software, machine-learning algorithms) that are converging in maturity, making it easy to travel the data-driven path. Data science, then, by this or another name, is indeed an idea whose time has come.

However, while much of the data science work in the industry is useful, and often ground-breaking, much is not. What are the circumstances in which data science succeeds, and where lie the delusionary traps?

Poor Definition

It can be understood too, but only dimly and in flashes. Not half a dozen men have ever been able to keep the whole equation of pictures in their heads. — Scott Fitzgerald, “The Last Tycoon”

Mathematical and statistical knowledge, advanced computing skills (including databases, high-performance computing and visualization) and substantive expertise (or application and domain knowledge) form the almost impossible intersection in the Data Science Venn diagram [3]. It is no doubt a great advantage if all these skills could somehow be concentrated in one human being — a new-age Spock [1], so to speak. One wonders, though, how many such unicorns could exist. It is also a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an overcompensation?

Figure 1: The Data Science Venn Diagram


It is interesting to note that DJ Patil’s conception of the term “data scientist” seems to suggest a generalist product manager or developer, rather than some kind of a super-scientist [5]:

Yes. It’s good and bad. I think there’s this interesting question of, Well, what is a data scientist? Isn’t that just a scientist? Don’t scientists just use data? So what does that term even mean?
You’ve had one of my co-authors, Hilary Mason, on the show, and the thing we joke about and we wrote about together, is that the number one thing about data scientists’ job description is that it’s amorphous. There’s no specific thing that you do; the work kind of embodies all these different things. You do whatever you need to do to solve a problem.
— DJ Patil, “10 questions for the nation’s first data scientist”

This kind of a broad definition seems to be the consensus in the industry: “a set of activities involved in transforming collected data into valuable insights, products, or solutions.” [6].

Now, numerous master’s programs in data science have been introduced in recent years providing basic training in research methods, statistical modeling, applied machine learning and big data, to address this need for generalists with the right mix of competencies, notwithstanding the criticism that such courses, which are “spreading like mushrooms after the rain”, address only a part of what it means to be a data scientist [7].

A word about calling it a science: Is data science the study of data? That would be a strange claim — as Patil does suggest — for every empirical science is based on studying data. Moreover, a science of data disconnected from any domain already has a name — Statistics, and if it is the science of business, it should probably be called Business Science [8]. Of course, despite the poor coinage, data science may one day evolve into a real science by settling on sound foundational principles (like computer science did over its first couple of decades). For now, it seems to function more as a buzzword designed to attract talented scientists from various disciplines to work for business [9, 10]

One consequence of this vague definition is that experts have variously claimed data science to be Statistics 2.0 [8, 11], Computer Science 2.0 [12] and Business Analytics 2.0 [8]. This is partly because of greater interdisciplinary work between statistics and computer science, and a necessary coming together of two different ways of thinking [13, 14], but it also points towards a fundamental confusion.

I should also note that the term “data science” has a long history, as has the confusion surrounding the term. In fact, as early as the 1960s, Peter Naur suggested it as a better name for computer science [15, 16], and Jeff Wu, in the late 1990s, suggested that “statistics = data science” and “statisticians = data scientists”!

Easy-to-Fake

The last decade has seen many areas of research (parallel processing, machine learning, visualization, statistical programming languages) maturing into technology, which had made it possible for one person (perhaps an expert in one discipline) to take a project through the stages of data ingestion and manipulation, statistical modeling, and visualization entirely on his or her own. This democratization of algorithms and platforms, paradoxically, has a downside: the signaling properties of such skills have more or less been lost. Where earlier you needed to read and understand a technical paper or a book to implement a model, now you can just use an off-the-shelf model as a black-box. While this phenomenon affects many disciplines, the vague and multidisciplinary definition of data science certainly exacerbates the problem.

These reasons make it easy to pose as a data scientist, and the huge demand and limited supply has made data science a fertile ground for poseurs [17, 7].

Deliberate faking aside, the difficulty in evaluating what is being done, leads both business leaders and data scientists down blind alleys.

Addressing the Problem

One response to data science’s purported need to be “all things to all people” [18] has been to carve up the space: type A (analysis) vs. type B (building) data scientists [19], statistical vs. database vs. computer systems data scientists [18], and human-targeted vs. machine-targeted data scientists [20]. These are sound suggestions, essentially acknowledging the difficulty of finding unicorn data scientists, and urging the creation of teams with the full range of skills. Such teams then face the problem of melding a disparate set of skills.

The ease of faking omniscience has brought warnings about incomplete knowledge. For example, Drew Conway pointed out (in his article detailing the data science Venn diagram) that the three data science skills on their own or combined with only one other “are at best simply not data science, or at worst downright dangerous” [3]. Most problematic, according to him, is the combination of hacking skills (computing) and domain expertise, because it is in this intersection that people “know enough to be dangerous”.

However, it is not simply an intersection of skills that is lethal, but these skills applied to the wrong task. In fact, the very set of skills that work well in one arena could be dangerous in another. Moreover, the fact that these skills rarely come together in one person, often leads to problems when data scientists work with each other.

So, what specifically goes wrong when different types of data scientists, possessing very different skill combinations, work with each other or with business stakeholders, on varied data science problems? An oversimplified model of the data science landscape can help us answer.

A Finer Scalpel: The Data Science Landscape

Figure 2: The Data Science Landscape


The data science landscape in figure 2 has Modeling Difficulty on one axis — representing how tractable the problem is to simple statistical modeling or machine-learning — and System Complexity on the other axis — representing complexity of the business processes being modeled, domain dependence of the data, scale of the data, timeliness requirements, and so on. This representation combines two of the components of data science, computing skills and domain expertise, into one dimension (system complexity), primarily for convenience of analysis. This quadrant view could be further refined into an octant view by separating out these components (more on this later). The terms simple and complex could be misunderstood here — to be clear, we are concerned not with the simplicity of the final model, but how easily a task can be modeled using state-of-the-art technologies.

Movement along the Modeling-Difficulty axis brings an increasing need for a specialized data scientist (e.g., an expert in time-series analysis or text processing), while movement along the System-Complexity axis leads to greater need for domain expertise and systems understanding. For instance, a classification problem, where the domain already provides data in the required form (say lots of historical data was manually tagged, and the task is to automate the classification), probably slots into Quadrant-1 (Q1). Another example of a Q1 task would be sentiment-tagging when you have an off-the-shelf model pre-trained on data from the same domain. The task would move to Q3 if the data from the target domain is quite different and needs to be preprocessed and modeled anew. (The promise of deep-learning is to reduce differences along this dimension, by automating the feature engineering involved [21].)

On the other hand, a move to Q2 or Q4 happens when formulating the problem is itself a challenge. Or when there are aspects of the system that cause assumptions to break — the data is updated in specific ways every week, or the value of the data stream decays perhaps due to changing user behaviour (e.g., Google Flu Trends [22]). Or if there are vast differences within a variable that are not represented in the data (e.g., state-level data aggregated at the country or continent level).

The Delusions

We can now locate the delusional circumstances in data science within the above landscape. Broadly, these delusions may be categorized as (i) between-quadrant effects: delusions due to a mismatch between the task and the people assigned to them and (ii) within-quadrant effects: delusions due to confusion intrinsic to a quadrant.

The illustrations cited are drawn from cases I have encountered, but sharpened and simplified to make them clearer.

Between-Quadrant Effects: The Delusion Matrix

Labeling tasks and resources (people) by the quadrant they belong to (Q1–4 tasks and Q1–4 resources) brings us to a confusion matrix, or what could be called a “delusion matrix”.

Rough representation (see text for details)

Figure 3: The Data Science “Delusion Matrix”


1. Lipstick on a Pig: Q3/Q4 tasks and Q1/Q2 resources

This effect manifests itself when a generalist, often inadvertently, steps out of his zone of competence.

Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts [14] and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).

Overheard: “Survival analysis? Never heard of it … Wait … There is an R package for that!”

A variety of machine learning algorithms can be deployed nowadays at the click of a button, producing at least superficially appealing outputs. Robust testing practices can raise a red flag when a move from Q1/Q2 to Q3/Q4 has been made, at which stage a more specialized data scientist can be brought in. It may be too much to ask, but if the Q1/Q2 resource could say “I don’t know” when he or she is in uncertain territory, a more timely intervention may be possible. There should be no disgrace in saying so, but unfortunately there is, for a data scientist by definition knows nearly everything.

2. The Tyranny of Low-Hanging Fruit: Q1/Q2 tasks and Q3/Q4 resources

A company hires specialized data scientists (Q3/Q4 resources) who expect to do at least some science, but all the business really needs, and has the data or appetite for, are simple heuristics or manual processes [23]. A bunch of frustrated data scientists is the result.

Illustration: A marketing team wants to rank prospects for a certain digital content subscription offering (lead scoring). The initial dataset, with around 50000 leads, has only two attributes: the amount of data used by the lead in a trial (in MBs) and the phone/device used (e.g., iPhone 6, iPad Mini). An analyst from the marketing team has fiddled with the ranges of the two attributes to come up with a model in Excel, which the head of marketing is thrilled with. This bliss is unaffected by warnings from the data scientists that the model is both brittle and non-scalable.

Overheard: “You want me to fit a model to these 12 data points?”

Of course there are times when a business only needs simple heuristics. And there are times when opportunity costs and time-to-market constraints mean that only simple heuristics can be afforded. Both the organization and data scientists need to be aware of such circumstances before the hiring is done. However, constraints or not, bad science remains bad science and has its own costs.

3. The Wonderful Wizards: Q4 tasks and Q3 resources

These situations generally occur when the company hires data scientists out of a fear or missing out rather than with a specific problem in mind [24]. A hands-off attitude — “here’s the data, now do some data science magic” — plagues these projects. When business managers fail to communicate crucial domain insights, the projects drag on in the exploratory stage for much longer than necessary, with the consequence that most of the initial enthusiasm for the project is lost. And if these managers do not participate in the exploration and definition of the problem, but are eager to play both Judge and Jury, one may be sure they are only waiting to don the Executioner’s robes.

Illustration: The asset management department of an organization wants to examine records of assets and inventory to identify anomalies — it is suspected that many records suffer from incorrect classification. For example, a desktop computer is occasionally classified as “office stationery”. The data science team is handed millions of records with the item name (something like “Dell Optiplex 2020”), description and classification. One obvious approach to the problem is to cluster the descriptions and look for classifications that seem out of place (another approach would be to search for item categories on the web, but this would work only when the item names are clean, and even then it could be quite complicated for smaller items). But the descriptions, as keyed in by the purchase assistant, have tremendous variety, with abbreviations and spelling mistakes in abundance (“desktop computer”, “dekstop”, “computer”, “pc”, “DT computer” etc.), and it’s easy to see how these could be confused with other item categories (“computer paper”, “desktop supplies”, etc.). The domain experts in the department neglect to inform the data scientists that all company items are linked through their item code to their prices, in another database which is accessible to them.

Overheard: “Just give them a dump of all the data. They will use data science to make the insights pop out”

The excessive mathematization of finance [25] and projects such as Google Flu Trends [22] can be looked at as high-profile illustrations of the myth of the wizards. These failures also show that the blame doesn’t always attach to the domain experts, but often to the data scientists who fail to question their own assumptions about a complex system (such as financial markets, the weather, or human behaviour).

The broader point is that exploratory projects often involve significant leaps of faith. Domain experts are best placed to judge whether these are reasonable.

Within-Quadrant Effects

4. Shibboleths: the Q3 Hodgepodge

The broad definition of data science has made teams much more heterogeneous, which is no doubt a good thing in many ways. The downside is that there are many misunderstandings due to the differing backgrounds. Firstly, disciplines differ in terminology: what is an observation model for one may be a sensor model for another; the term kernel may evoke a different concept for each; when a statistician mentions covariates a machine-learning scientist thinks of features, and when the former mentions hypothesis-testing the latter thinks of a quick escape. Secondly, disciplines differ in what they place emphasis on: system-building vs. prediction vs. inference, and so on [13]. These issues can affect not only day-to-day interactions but also hiring decisions.

Illustration: An economist applies to a data science team with a predominance of computer scientists. She has completed her postgraduate thesis on how certain trade policies affect the economy, followed by a couple of years of quantitative work in the industry. Her interview focuses almost entirely on how much coding effort her thesis involved, followed inevitably by a rejection.

Overheard: “I can code that entire thesis in two weeks”

5. Technology Giveth And Technology Taketh Away: Shrinking Q1

The opportunities created by technology are at risk of being eroded by the next stage of its evolution. Most Q1 problems can be solved today by push-button software (once the data is in the right place and in the right format). And as awareness of machine-learning techniques grows among business analysts and managers, greater automation will help them take over the data scientists’ shrinking role [26].

To what extent will data science be automated? Though opinion is divided on this question [27, 28], the consensus seems to be that expert-level (say Q3) roles will remain relevant for the next few years at least, though practitioners of deep learning may disagree [21].

Data scientists, whose jobs have been reduced to chaperoning a tool, may find it useful to reskill themselves.

6. They Get the Data, You Do the Science: Q1/Q3 Fragmentation

This delusion applies more to the computing skills dimension, which is not represented in the delusion matrix (these are the octants where strong computing skills are required but not too much domain or modeling expertise).

Once the realization sets in that hiring a unicorn data scientist is next to impossible, it is possible to go too far in the opposite direction: hire one person (or team) for data ingestion, one for data manipulation, one for data integration, one for data modeling, and finally one for statistical analysis and machine learning. While such compartmentalization may be essential for taking data science tools to production, it is downright harmful in the exploratory stage. If the hiring manager is looking for “Sqoop” and “Informatica” experts before there is a problem to be solved, he or she may be bringing back the very red tape which data and algorithm democratization were supposed to cut through [29]. The data scientist will spend half his time waiting for the right data to be ingested, and then waste most of the remaining time in reprocessing the data according to the requirements of the algorithm. The reason is that the tools for data science often need to be matched carefully to the task at hand. A wide range of tools and libraries (e.g., on top of Hadoop/Spark) may have to be explored before choosing the best one. Investing in the full cycle of production activities for each would be very inefficient. For instance, while processing time-series (TS) data, whether we expect to keep one TS in memory at a time or slices of many TS together would drive the choice of library.

A Personal Takeaway

It is possible to derive a set of dos and don’ts based on the above delusions (e.g., do not hire data scientists unless you have a problem to solve, do guard against too much homogeneity in your data science group, etc.), though I would hesitate to frame these issues in such stark terms outside of any context. Hopefully awareness of these delusions can help reduce some of the teething trouble that data science as a discipline faces in the industry.

As someone still struggling to navigate the data science journey, I think three points are worth stressing:

(i) To a researcher, data science brings wonderful opportunities to do interdisciplinary work at a much faster rate than usual. However, the skills demanded of a data scientist can only be honed over a long period of time [7]. While technological advances make it easy to be lulled into a false sense of expertise, the truth is that each domain, each subfield and each tool demands a period of internalization before a data scientist can handle them with confidence — vita brevis, ars longa.

(ii) There is plenty of work in the data science industry that demands Data Jugglery, Jugaad and Jujitsu [30], and not much else. These skills are extremely valuable, but unless they are accompanied by some core scientific work, a researcher should look at such jobs with suspicion.

(iii) Great data science work is being done in various places by people who go by other names (analyst, software engineer, product head, or just plain old scientist). It is not necessary to be a card-carrying data scientist to do good data science work. Blasphemy it may be to say so, but only time will tell whether the label itself has value, or is only helping create a delusion.

  • Thanks to various colleagues, past and present, with whom I have shared enlightening conversations on this topic.

References

[1] DJ Patil and Hilary Mason. Data Driven: Creating a Data Culture. O’Reilly, https://www.oreilly.com/ideas/data-driven, 2015.

[2] Thomas H Davenport. Competing on analytics. Harvard Business Review, 84(1):98, https://hbr.org/2006/01/competing-on-analytics, 2006.

[3] Drew Conway. The data science venn diagram. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram, 2010.

[4] Karl Broman. I am a data scientist. https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/, 2016.

[5] DJ Patil/Chau Tu. 10 questions for the nation’s first chief data scientist. http://www.sciencefriday.com/articles/10-questions-for-the-nations-first-chief-data-scientist/, 2016.

[6] Sophie Chou, William Li, and Ramesh Sridharan. Democratizing data science. KDD, http://blog.sophiechou.com/tag/democratizing-data-science/, 2014.

[7] Vincent Granville. Fake data science. http://www.analyticbridge.com/profiles/blogs/fake-data-science, 2013.

[8] Robin Bloor. A data science rant. http://insideanalysis.com/2013/08/a-data-science-rant/, 2013.

[9] Gil Press. Data science: What’s the half-life of a buzzword. http://www.forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword/, 2013.

[10] Sophie Chou. What can be achieved by data science. https://mathbabe.org/2014/08/18/what-can-be-achieved-by-data-science/, 2014.

[11] Karl Broman. Data science is statistics. https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/, 2013.

[12] Andrew Gelman. Statistics is the least important part of data science. http://andrewgelman.com/2013/11/14/statistics-least-important-part-data-science/, 2013.

[13] Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statist. Sci., 16(3):199–231, 08 2001. doi: 10. 1214/ss/1009213726. http://dx.doi.org/10.1214/ss/1009213726.

[14] Larry Wasserman. Data science: The end of statistics. https://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/, 2013.

[15] Peter Naur. The science of datalogy. Commun. ACM, 9(7): 485–, July 1966. ISSN 0001–0782. doi: 10.1145/365719.366510. http://doi.acm.org/10.1145/365719.366510.

[16] Peter Naur (Wikipedia). https://en.wikipedia.org/wiki/Peter_Naur, 2016.

[17] Cathy O’Neil. Statisticians aren’t the problem for data science. The real problem is too many posers. https://mathbabe.org/2012/07/31/statisticians-arent-the-problem-for-data-science-the-real-problem-is-too-many-posers/, 2012.

[18] Michael Mout. What is wrong with definition of data science. http://www.kdnuggets.com/2013/12/what-is-wrong-with-definition-data-science.html, 2013.

[19] Michael Hochster. What is data science (quora). https://www.quora.com/What-is-data-science, 2014.

[20] Michael Li. Two types of data scientists: Which is right for your needs? http://data-informed.com/two-types-of-data-scientists-which-is-right-for-your-needs/, 2015.

[21] George Leopold. Machine learning tools to automate data science. https://www.datanami.com/2015/10/19/machine-learning-tool-seeks-to-automate-data-science/, 2015.

[22] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The parable of google flu: Traps in big data analysis. Science, 343(6176): 1203–1205, 2014.

[23] Greta Roberts. Stop hiring data scientists if you’re not ready for data science. http://www.talentanalytics.com/blog/stop-hiring-data-scientists-if-youre-not-ready-for-data-science/, 2015.

[24] Yanir Seroussi. You don’t need a data scientist yet. https://yanirseroussi.com/2015/08/24/you-dont-need-a-data-scientist-yet/, 2015.

[25] Nicolas Bouleau. On excessive mathematization, symptoms, diagnosis and philosophical bases for real world knowledge. Real World Economics, 57:90–105, 2011.

[26] Barb Darrow. Data science is still white hot, but nothing lasts forever. http://fortune.com/2015/05/21/data-science-white-hot/, 2015.

[27] Gregory Piatetsky. Data scientists automated and unemployed by 2025? http://www.kdnuggets.com/2015/05/data-scientists-automated-2025.html, 2015.

[28] Sandro Saitta and Nestle Nespresso. Data science automation: Debunking misconceptions. http://www.kdnuggets.com/2016/08/data-science-automation-debunking-misconceptions.html, 2016.

[29] Sunile Manjee. Pluralism and secularity in a big data ecosystem. http://blogs.teradata.com/data-points/pluralism-secularity-big-data-ecosystem/, 2015.

[30] DJ Patil. Data Jujitsu: the art of turning data into product. ” O’Reilly Media, Inc., http://radar.oreilly.com/2012/07/data-jujitsu.html, 2012.