As a B-list celebrity data scientist, and skeptic of the underspecified, overhyped “Data Science” movement, I was so glad to find David Donoho’s critical take in 50 Years of Data Science, which has made its way around the Internet. Read it now. I suppose it should really be called 53 years of Data Science, but 50 is a popular number of things to have something of.
This paper narrates a strong Statistics-based take on Data Science, one which rightly punctures much of the puffery around this term and “Big Data.” Ultimately, it proposes its own better Statistics-based take. The smack-down is rewarding, but reading it given my Engineering-based take on Data Science, it looks like an attack on a partial straw-man. Along the way to arguing that Data Science can’t be much more than Statistics, it fails to contemplate Data Engineering, which I’d argue is most of what Data Science is and Statistics is not.
Through section 2.1, the paper might be summarized thus:
For better or worse, something called “Data Science” is a big deal. Whatever it is seems to be co-opting Statistics as a mere piece, maybe even not so important, of a new and larger discipline. The purported reason that this is more than just Statistics is the size of data, but that’s nonsense because sometimes Statistics has grappled with large data in the past.
It offers a few anecdotes in support, and the premise is sound. Statistics and numerical computing have been intertwined for decades, and as in all areas of computing, we always crave ways to analyze a little more data, a little more quickly.
However, it doesn’t follow that scale never qualitatively changes a field. Scale has changed the nature of statistics itself over time. Its origins lie in an era of too little data, and it evolved techniques to compensate for too-small sample sizes, and optimized for ease of manual calculation. Compare that to the very example cited in the paper: punch cards to make tallying a census feasible. Hardly what William Sealy Gosset would have imagined as his field?
Section 2.2–2.4 may be the single central blind-spot in this world-view. I might paraphrase it as:
We’re told “Data Science” and “Big Data” are different because they require different technical skills and new jobs. But all of these new skills are just so much running to stand still. Just look at how much software and work it takes to find a maximum of things in Hadoop’s MapReduce! It’s hopelessly primitive compared to existing tools.
It may go without saying that virtually nobody would now use MapReduce directly for this. Frameworks like Apache Crunch, Cascading, or especially Apache Spark have long since made this nearly a one-liner in within the Hadoop ecosystem alone. The MapReduce example is chosen exactly to maximize exposition of the boilerplate of Mappers and Reducers while requiring minimal exposition of content — just taking a max. But, this is a minor quibble. It’s true that the cart has been firmly in front of the horse in much of the Big Data movement. Too many have assumed that all problems must be solved as huge-data problems first, as a matter of future-proofing. Yes, somewhere, someone has used MapReduce to count a 1MB data set.
Matlab and R are invoked to show just how mature and elegant their max() functions are in comparison. This is ironically what many would call in to explain the problem that Big Data technologies were built to solve: to scale beyond what Matlab or R can accomplish on one big machine. That is, good luck with max() over 10 terabytes of data. Good luck with max() over even 1 gigabyte of data, if that data isn’t already on your workstation.
Really, this argues that if basic things are hard to build in distributed computing, nothing of any sophistication can be implemented there. What does it profit us to scale up primitive techniques? You’d only have to look to projects like Spark’s MLlib, which provides a perfectly suitable distributed implementations of, say, Latent Dirichlet Allocation, to see what a few lines of code can accomplish at scale. (Heck, LDA was implemented directly on MapReduce by Apache Mahout back in the day.) MLlib is no R, but it’s already entering the same ballpark.
The section attempts to conclude that more data is no big deal, but Peter Norvig and Google might disagree, given their oft-cited The Unreasonable Effectiveness of Data. It’s an argument from practical experience at web scale that, in a nutshell, more data beats better algorithms. If this is even sort of, sometimes true, then there might just be something of value that’s different from Statistics at scale, and that something might be what people are calling Data Science.
Sections 3–6 survey key shifts in thinking within Statistics as it has already evolved towards “Data Analysis.” Paraphrases follow:
- John Tukey’s The Future of Data Analysis, the 50+-year-old paper behind the title, asserts that Statistics must become concerned with the handling and processing of data, its size, and visualization.
- John Chambers’s S language, the predecessor of R, is the forerunner of the “notebook” concept, where an academic paper can be made runnable, scripted, shareable.
- Leo Breiman’s Two Cultures notes that concern strictly with prediction accuracy is different from inference about models, and that the former is under-represented in academia but prevalent in industry, where it has turned into “machine learning.”
- The “common task framework” inherently values the quality of outputs of a black-box model, not the subjective explanatory power of the model, and this is immensely powerful in practice in driving better outputs.
Section 7 surveys how Data Science is taught today, and Section 8 proposes how it should be presented. The curriculum here is well-thought-out and no less valid than the many worse, less-specific notions of Data Science out there. It is something bigger than pure Statistics, and acknowledges that this something is most certainly to do with data and computing.
Yet, how can this view embrace data and computing here, but decry the scale of data as irrelevant at the beginning? By Section 8, it’s merely dismissed as a detail. Data integration and wrangling is given a paragraph as well, with a recommendation to adopt the excellent so-called Hadleyverse (after its primary mover, Hadley Wickham) of R packages for data manipulation. This itself embeds an assumption that something like R is a sufficient platform for Data Science in 2015.
From where I sit at Cloudera, a Data Scientist with only the skills listed in the proposed curriculum would be a formidably useful addition to any team. However, a team with only these skills would struggle to create a real live production anything. It’s one thing to create an excellent fraud detection model in R, and quite another to build:
- Fault-tolerant ingest of live data at scale that could represent fraudulent actions
- Real-time computation of features based on the data stream
- Serialization, versioning and management of a fraud detection model
- Real-time prediction of fraud based on computed features at scale
- Learning over all historical data
- Incremental update of the production model in near-real-time
- Monitoring, testing, productionization of all of the above
And in the end, in industry, any predictive model of actual value is going to be productionized, and need to accomplish one or more of the above. Far from a mere detail, it’s 80+% of the work I see people whose job title contains the string “Data Scientist” actually doing.
The most common argument is that the concerns above are simply something else, and not Data Science; they’re Data Engineering, or Software Engineering. Yes. Then again, much of the curriculum presented in the paper is something else: pure Statistics.
The paper presents an excellent picture of most of what the broad umbrella of Data Science seems should cover. I would argue that Software Engineering remains as big a part, and can’t be written out of the Data Science story.