We live in an era of uncertainty. It is uncertain how economy will go on the aftermath of COVID-19 (fast/slow recovery, where and which industry sectors will be more affected, etc.). It is uncertain how technical people will be working from now on (physical vs. remote sites). And, of course, it is still uncertain what exactly a Data Scientist (DS) is and should be (and should be not) doing in the industry.
Data Science? Not my favorite nomenclature…
The well-known academic researcher Peter Flach (10 years Editor-in-Chief of the Machine Learning journal) has recently published an article where he says that Data Science is not a very good nomenclature for the field. The main reason behind such statement is that “Data Science” is prone to misguided interpretations on assuming that Physicians, Biochemistricians or Civil Engineers as Data Scientists if they work intensively with Data (aka being Data Driven). Thus, Prof. Flach prefers the term “Science of Data”, defining it as follows: “ (…)subject that studies data in all its manifestations, together with methods and algorithms to manipulate, analyse, visualize and enrich data. It is methodologically close to computer science and statistics, combining theoretical, algorithmic and empirical work (…)”. And, it is important to stress this out: I fully agree with him.
Nevertheless, there is a trend in the industry on pushing to have “full stack data scientists”. The number of articles out there that support this trend are numerous…but I leave you with an interesting shortlist for your reference:
“Full Stack data scientists” are just yet another facet of the AI hype.
According to this trend, these mythological individuals should be capable of understanding the business problem, perform root cause analysis and derive hypothesis (as a generic Big3 strategy consultant would do), prepare all the data that they will need + the data pipelines needed to put something in production in the cloud, create model(s), validate model(s), deploy model(s), monitor the model(s) in production — from a DevOps perspective (is the service working/scaling properly?), from a business perspective (is it delivering the expected target KPIs?) from a scientist perspective (is it generalizing well? is there any concept drift?), from an engineer perspective (has the data input the expected format?) — and, of course, be able to present the expected/obtained results to an heterogeneous audience of stakeholders in a concise and yet understandable fashion. Finally — and this one is the most important skill — a data scientist must be able to fly! :)
Naturally, this generalist DS view is not shared by me — or, at least, not fully, as you will understand in later parts of this post — as these people tend to be very rare (and, if they do exist, they should not be staff/team member level data scientists but leaders instead). This new full-stack DS hype goes in the direction of raising the expectations of what AI Experts/Data Scientists (terms which I here use interchangeably as a writer convenience but are not quite the same thing) can and should deliver to unrealistic levels. In a short sentence, “full stack data scientists” are just yet another facet of the AI hype. And, as other sectors of our society have been showing to us, history tends to repeat itself — in this case, the risk of being facing yet another AI Winter soon.
Mayday, mayday…we need Data Scientists to do Data Science!
Data Scientists must be good on doing Data Science. And Data Science problems are already difficult enough to solve per se…imagine if you a) are not a specialist and b) you still need to care for all those stuff around production data science all by yourself. It looks pretty tough…doesn’t it?
“If I had an hour to solve a problem, I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” Albert Einstein
Usually, I do not devote time to blog writing. When I see other Data Science Leaders doing one LinkedIn post per day, multiple blog posts per month and even several books every year…I wonder if they either do not sleep or if, in alternative, they simply are not working on Data Science at all(!!!). Whenever I feel to have something to contribute to the DS community, I prefer to do it on the technical side of things, by participating (as an author, committee member or track chair) on the top peer-reviewed venues in the area. However, this hype issue affects the industry so much that I believe that a contribution like this will help other colleagues (both at staff level and at leadership one) on organizing better their career and/or learning paths, workload, teamwork and, ultimately, throughput and business impact.
The purpose of this post is to explain why yet another fancy term to define the role of a data scientist is not such a great idea. Moreover, I also point out where are the pitfalls of such hype and which are the real problems that need to be tackled in order to push the widespread industrial adoption of data science. The ultimate goal (of all of us, I believe) is to raise the bar of ML-driven success business cases (predictive and prescriptive analytics) to a business-as-usual standard. At least, that is my main motivation.
Besides the present post entitled Beginnings, this post has two more parts. In part two, The Fall, I will be deconstructing the four key arguments (I-IV) of those who argue that the DS generalist is the way to go (vs. the specialists). Finally, within part III, The Rise, I will present three key ideas to address the real issues behind those problems, including a definition of what a modern data scientist should be doing. And yes, albeit I prefer Batman to Super-Man, the blog post’s title have little to do with the Dark Knight trilogy movies.
Curious for more? Wait for the next two parts…in a blog near you.