“Zero-Stack Data Scientist” — Part II, The Fall

Luis Moreira-Matias
9 min readSep 16, 2020

--

In Hollywood, it is known that the sequels are rarely better than the original movie/parts. Batman: The Dark Knight Trilogy is a notorious exception. I believe that this post is yet another.

After explaining what many claim to be a “Full-Stack Data Scientist” in Part I of this post, I am going to discuss the main claims that sustain “the generalist” as a way to go for a Data Scientists (DS) in industry. In this post, I present, discuss and deconstruct the four key arguments that “the generalist” fans use to support their point of view.

Ancient Rome burning.

The four false pillars of the “full-stack data scientist”

(I) Root Cause Analysis is end-to-end — so, a DS also should be!

Root Cause Analysis is something strategy consultants have been doing for decades. One of the main pro-generalist argument is that the causes for the low performance of Machine Learning models lies outside the modelling stage. And, by experience, I can classify that statement as truthful. Now, two questions arise: (1) should RCA be done by data scientists? And, if yes, (2) would it help them to be more effective on finding the root causes if they own the entire process? And, so far, I still follow with them: yes and yes for the two questions. But quickly, another question arises: how would it be if data scientists owned this project end-to-end? Well, the most likely answer would involve delays on project completion, lower user acceptance, troubles on scalability…i.e., more root causes to find and a bigger problem. Why? Because they are not experts (e.g.: in UX design, devops or data engineering)! When you try to do something that is difficult and you are not an expert on it, you will take more time to do it and you will be more likely to do a mistake…or several. Ab uno disce omnes.

What is the problem that usually happens when specialized DS try to run RCA? They stumble onto other team’s walls. Why? Three classics are: lack of ownership ,aka Social Loafing, lack of communication and lack of data driven awareness. If other teams are not available to step up or their technical/communication mistakes and/or are not aware of what the data science modules do (input/output) and of the possible consequences that their (bad) work may have there…your organization has a problem of Culture, Values and, ultimately, Leadership. And, if that happens, you can have whatever type of Data Scientists you want…as you will always be closer to failure than to… well, pretty much anything else.

Multiple roles in Batman: The Darknight Rises.

(II) Multiple Roles bring Communication Overhead

I have to admit this is the one that makes less sense for me. If you work to develop a project in a multidisciplinary team, you need to communicate. If you work alone, you do not need to communicate….but, to stamp communication as overhead is something that goes beyond my comprehension. Some — well-intentioned individuals, I’m sure — mention that Data Science is an immature field and, consequently, it is difficult to create the blueprint of the final product beforehand — as we typically have in Software Engineering. That is something that I can even agree upon. Consequently, rapid (re)build-try-fail iteration is key for success. Naturally, that speed boost would come by full project ownership — aka taking the generalist way. And, oops…I found a fallacy here.

In software engineering, you generally have already multiple teams in place organized either in tribes, technologies or functions. They work together to accomplish a singular goal…but, on that process, they have competing agendas themselves…consequently, alignment and communication is key for their success. Naturally, impose such amount of meetings, stand-ups and other rituals to all staff level technical team members may be tough…and thus, also naturally, the role of the Agile Product Owner got born.

Data Science is no different. And, although this is a recent trend, it is a popular one. To have a real impact in an organization, every Data Science team needs to have an (Agile) Product Owner in place (check its description below). And please…don’t tell me that your team does not have one and/or your organization is too small for that…because then, perhaps, that (the lack of a Product Owner for Data Science and/or a Team Leader/Head) is where the real problem started in the first place.

(III) Learning by doing is the key!

Nowadays, there is a plethora of online courses for Machine Learning and Data Science following the early success of Coursera. Not rarely, these courses come with a promise like “Zero-to-hero: How to become a data scientist in 6 months” (I even found a blog post with that expression). In these courses, it is often said that Data Science is an empirical discipline where to learn by doing is key. Like in most of other jobs, I agree that experience on actually doing stuff (vs. only theorizing about it) helps. A lot. The problem is that expression implies that everybody can become a Data Scientist if they try enough. Without reducing the merit to people that studied other STEM disciplines (like Physics or Production Engineering) and then, with a lot of effort and persistence, made a difficult transition to Data Science careers…it is a fact that they will struggle to become Professional Data Scientists as they lack foundations about the topic. In other words, they know when and how to use the available off-the-shelf tools/libraries/methods till a certain but they don’t know exactly how it works. Please read more about my thoughts on Professional vs. Citizen Data Scientists on this keynote that I gave last year on IT Arena 2019 (Lviv, Ukraine), where I shared the stage with the likes of Microsoft, Spotify or Uber, among others.

What are the consequences of such ignorance? Firstly, it prevents to really tailor their methodology to their business application (a classical step in CRISP-DM methodology). This entails that the result will not have so much impact in your business. This makes leaders wonder how they can extract more value from these Data Scientists who are delivering below expectations…and the answer invariably comes on about “getting them more stuff to do”. And that stuff would be devops, data engineering or analytics dashboards…does this ring a bell? Of course…it is a Full-Stack Data Scientist! :)

“Any fool can know. The point is to understand.” Albert Einstein

I am doing hiring for positions in the all-data-things space (date engineers, data scientist — different flavors and machine learning researchers) for the last 4 years. For the latter two roles, my interviews always contains a technical Q&A on their past data science projects which covers, among other things, foundations of the used methods. Let me share with you some of the pearls that I’ve heard during these interviews in the last years:

Models trained with RandomForests get better as larger is the number of trees;

The difference between Logistic and Linear Regression is that one provides linear models…and the other doesn’t;

(Answer to: why you are using Huber Loss when your evaluation metric is RMSE?) In Data Science, we need to try everything first to see what works the best. (He literally tried all the linear regression methods available on sci-kit learn).

AUROC stands for…Accuracy.

(On comparing SVM’s against other learners…) The used model hyperparameters are the default ones. This is a fair comparison as the performance uplift after tuning those would be always small.

(When asked what to do to create a model from a 500 examples training set) Use a CNN in Keras. Always. Preferentially, a pre-trained one.

If you did not find something wrong with any of the above sentences, let me tell you right away: data science may not be your best career path.

Cold, Hard, Foundational Technical Expertise: The Kryptonite of the “Full-Stack Data Scientist”.

Work hard is mandatory…but to work hard & smart is even better!

Following this line of thought, another argument is that to own all the development process brings a higher sense of satisfaction. I am already imagining the joy on people’s face saying “I did it, I did it…it took me one month without weekends but I did it!”. Probably, then, the same full-stack data scientists will spend another month without weekends to fix the bugs that will be discovered after going live…but that is ok as he/she made impact :) On a more serious note…there is nothing wrong about working hard…that is key to any career, especially Data Science! But to work smart must be equally important. Investing time on doing things which you are strong at just makes you…stronger! If you doubt it, check out how seriously Cristiano Ronaldo trains scoring goals…but he certainly does not practice much goalkeeping :)

Moving the needle…

(IV) +-1% of accuracy does not impact my business…

It is often said that +1% is not a performance uplift that is worthy to invest time into. It does not move the needle. Consequently, what you need actually is some guys (generalists) to use some data to build any model that works. That can actually be achieved on a quick-and-dirty way with some build-try-fail iterations and a lot of copy-paste from online helpboards such as StackOverflow. Make a MVP to run a POC. Have a spike…a proof point to show our investors that “yes, we can!”. All good fellas, aren’t they?

Do you have a data strategy & governance roadmap? No? Well, that’s bad…

Today, one of the problems of organizations is the lack of a data strategy/ roadmap and/or data governance policies. Often, senior leadership (both in corps and/or start-ups) is not aware of what creating and scaling a data-driven business can actually mean in practice. They have a business plan, sure, and naturally they like the scalability vs. reduced OPEX that automation brings into that…and of course, the competitive advantage that doing that fast can bring to their business. But they often forget to see the darker side of it. I like to call it to become data hostage.

Copy-paste code: 1$. To know which code to copy-paste: 10000$.

There are a series of implications on automating your business in a data-driven way: you become dependent on the used data — which means that, if there is a bankruptcy of a data provider and/or a regulatory change that forbids of using that data in the future…well, you are screwed. You also become dependent on the pipeline dependencies — packages, development language, versions, etc. — even if they run in any type of virtual containers. Finally, you also become dependent on your model: not rarely, a model performance may change with the scale (accept 20% of customers instead of 5%, recommend 10 products instead of 3 and so on). Here, you expect to maintain your business performance (driven by your model’s performance) after scaling up…or even improve it. But, let me tell you the following secret: it is not any model that is capable of that. However, by then is too late: you already have a team of citizen data scientists, you already promoted the guys for their great work (after all, you did the POC, you got the new funding round, you convinced the investors) and you already have a series of processes (dependencies) of a group of people that is far from being specialized on the tasks they need to do….now, at scale.

Do you want to scale your business one day? If yes, you need professional DS now.

Let me tell you another secret: if your business has (or want to have one day) scale and to have a predictive or prescriptive analytics engine in its core…you need professional DSs now. Probably, a series of other roles as well…but those, at least, for sure. And if not…well, perhaps your a company is not a place where a DS should aim to work at.

If you have such automated data-driven business in place, the impact of +1% in your model’s performance is tremendous…regardless of the industry. I can easily find two examples from Start-up/Credit Risk (where +0.01 in AUROC translates into +1M USD) or Corp/E-Commerce (where -1% on MSE can easily translate into +3M USD only in a quarter) which illustrate well such impact.

Tomorrow…is weekend.

In the next, third and final part of this post, I will go through the reasons why this discussion exist on what are the true obstacles for a large-scale adoption of data science in industry. Stay Tuned for more!

P.S.: I would like to personally thank to Fernando Costa, Sven Thies and Jihed Khiari the time they devoted on reviewing this post. Kudos to the three of them.

<< Zero-Stack Data Scientist — Part I, Beginnings

Zero-Stack Data Scientist — Part II, The Rise >>

--

--

Luis Moreira-Matias

Ph.d. in Machine Learning. Ms.c. in Software Engineering. Data Science Leader. Data Strategist. Keynote Speaker. Award Winner and Scientific Author on ML/DS.