Full Stack Data Science @ Cube26
Let us start with some facts. Being a data scientist is very much the sexiest job of the 21st century (especially since the infernal HBR article said so). We’re definitely collecting more data than we can analyze. However, neither big data nor data science as a discipline in itself are news. Nate Silver, in his book The Signal and the Noise, has implied very clearly that the Big Data problem has existed since the time Gutenberg invented the printing press. The cost of printing a book came crashing down after the popularization of the printing press, and suddenly people had more books than they could even hope to read in a lifetime. Without exception, all big data problems are just that — inability to consume data in an insightful manner and at a rate comparable to the rate of data being produced. It is no secret that organizations and individuals both suffer from this problem, and as countless experts have opined, it is the solution to this problem that ultimately makes or breaks a company. There is no dearth of literature advocating the fact that developing a data driven culture is the best and possibly the only way to avoid being overwhelmed by data, and instead, to turn it into a marketable, saleable resource. (For a short read on the subject, refer to http://www.oreilly.com/data/free/data-driven.csp)
As engineers or scientists, taking data seriously is something that ought to come to us naturally (Some of us also think that data science is just science. Indeed, there can be no science without data.) We are intrinsically trained to trust only evidence backed by data. Surprisingly, this trait, which is supposed to be hardwired into our minds, something we should have by default, has come to be looked upon as a qualification or as a vocational skill. This has happened because of the disparity between the cultural projection of data science and the actual underlying discipline.
At Cube26, we work every day to bridge this gap. Our approach to data science is what I like to call “Full Stack Data Science”. I’ve always maintained that real data scientists are not spreadsheet junkies. Their deliverables are data driven products, not plots, graphs and charts. Data scientists are people who deliver end-to-end pipelines that leverage an organization’s data (I don’t need to emphasize how much value an organization’s data holds, generations of data scientists will sustain on it) and deliver tangible business insights, at worst, and “smart” data driven products -products that sift through volumes of information to equip their users with all the relevant information they need to make a decision — at best. (The latter is the quintessential Big Data problem — there’s nothing more to it, really.) Most professionals would agree with this idea, but very few will fully embrace it. And it is not because they don’t want to or they cannot — it’s primarily because thinking of data scientists as real software developers — complete with the same restrictions and the same freedoms as software developers — is a cultural novelty, almost something that is argued either by a hipster or by an idealist. Only the best organizations have bridged this gap. Thus, my primary contention in this post is that data scientists are just people who write software (albeit very specific kinds of software) and therefore need to embrace the first principles and best practices from software engineering at large. Anthony Goldbloom, CEO and founder of Kaggle, agrees that most winning models come from data scientists who follow good software engineering practices (https://youtu.be/8KzjARKIgTo?t=1419).
Perhaps, at this point, it might seem that I am unduly emphasizing the software engineering aspect of data science in comparison to the mathematical and statistical aspect. Let me clarify that there is absolutely no doubt that a great deal of knowledge in statistics and machine learning is essential to good data science — this cannot be said often enough. However, it is equally important to understand that without using the best practices of software development, you can quickly reach the limit of what you can do with your data. At Cube26, we expect our data scientists to work on all aspects of a data — driven product: from its inception to its maintenance. The typical development cycle consists of:
- R&D involved in building the machine learning models needed by our products
- building a prototype of the machine learning model
- porting the model to the required platform (web, mobile or desktop)
- maintaining the infrastructure and pipelines required to drive the application — this often involves knowledge of QA systems and automation
All this can’t be achieved without exploiting the full extent of an engineer’s ability as a data scientist and as a software developer.
It is understandable that individual data scientists might not fully agree with this. Developing best practices, I believe, is a matter of culture and environment — it’s not a matter of skill. We have been working towards creating this very attitude at Cube26.
In the widely cited essay The Cathedral and the Bazaar, the author, Eric S Raymond, compares two software development models. The Cathedral model consists of a highly exclusive group of developers working on projects that are very restricted. In contrast, the Bazaar model denotes a group of developers working in a much more inclusive fashion and in full public view — which is much more open to inspection, criticism and therefore much more conducive to improvement. It is easy to see a parallel to these models in the data science community — and just like in the context of software development, it is high time data scientists stepped out of the cathedral and walked into the bazaar.