“Zero-Stack Data Scientist” — Part III, The Rise

Luis Moreira-Matias
10 min readOct 6, 2020

--

The true AI agent — Smith.

“Everything that has a beginning, has an end…Neo.” I know… this is from another trilogy…but even Agent Smiththe cinematographic character that made me closer to really believe in true artificial general intelligence — would be interested to know how this blog post trilogy unveils in the end.

In part I, you’ve understood that full-stack data scientists are just yet another facet of AI hype and why we don’t need yet another fancy title for the data scientist. In part II, I’ve explained why what we really need is data scientists to do is just data science (which is something very difficult to do, by the way). In this final part, I explain what is really supporting that full stack hype:(1) faulty organizational strategy and (2) critical but chronically unfulfilled data science related roles. To end my message with some charm, I leave my humble opinion on what a Modern Data Scientist should actually care about. Curious? Let’s jump right in!

(1) The automation staircase in Industry

As I clearly stated in a recent masterclass on the course “AI for Executives” of U. Halmstad (Sweden), there are four steps that you need to take before turning your company into an AI-driven company:

  • Digital-Driven: Your processes are digital. Most or all of employees use electronic devices. The data about your business is fully digital and be accessed/queried by specialized employees autonomously.
  • Data-Driven: Your decisions are data-driven. Everytime you take a major decision (hire/fire an employee, invest in a new product, etc.) you do it from cold hard data that you possess and/or collect for the decision making process. You rely less on intuition or experience of leaders and, when you do it so, you complement it with data.
  • Model-Driven: You have Machine Learning/mathematical models in production. They make forecasts, scenarios and/or suggestions (i.e. Predictive Analytics) that can improve your business. You take the output of these models in consideration for decision making.
  • AI-Driven: You have AI models in production. These models make forecasts and take decisions autonomously. You monitor their performance and their impact on the business closely…but ultimately decisions are driven by machines (i.e. Prescriptive Analytics).

Business Leaders want results now. That usually means taking the safe path…and staying away of automation. It also means that they will never be bold enough to “take off towards the moon”…

Most organizations struggle between step 1 and 2 of this process. The main reason for this is that their leaders are not data-aware people. They do not understand the data science development process, the strengths and/or the weaknesses and how to make a reliable bet on those. The sense of urgency and the pressure to get results short-term push them to be more down to Earth…which usually means to never actually be bold enough to take off. That is even more truthful in Europe where there is scarcity of venture capital (when compared to US or China). Therefore, this pushes them to make safer bets on getting jack-of-all-trades vs. professional Data Scientists. However, this issue also prevents them of going far and betting realistically on creating an AI-driven business. This is also why you worked in that company in the past where you felt that your data science work was making no difference.

(2) The missing roles

It is relatively straightforward to understand the difference between a Data Scientist and a Data Engineer. However, there are 50 shades of gray on the data world — or, more famous among data geeks, 41 shades of blue. Nowadays, there are a series of new roles that are required to both build and successfully execute data-driven and AI-driven business strategies Machine Learning requires. However, most organizations ignore this. I would like to shade light on three of those:

Schematic of an ML pipeline for Continuous Testing by Google Cloud.

2.1 Machine Learning Engineer/MLOps Experts — Many would classify MLOps as a set of practices to enable CI/CD of data science products. While data engineers have to be able to provide API endpoints and/or other ways to feed/collect data from different company services and practices…ML Engineers should be able to support large scale experimentation, service deployment and monitoring (i.e collecting ML metadata). Ideally, these people would be fully fledged software engineers with a lot of knowledge about using the latest cloud technologies, but who would also be experienced in putting ML models in production.

2.2 Data Science Product Owner — An Agile Enthusiast that communicates well, understands the big picture but also the DS development process and can align and prioritize what needs to be done with other teams/stakeholders. In many teams (like my current one) this role is played by their managers and/or team leader.

2.3 Chief Data Officer/Chief Data Strategist/Chief Science Officer — Having a stakeholder at a Senior Leadership level that understands the data science development process is more and more a key success factor for adoption. This person should ultimately decide (together with his peers) on the roadmap for AI in his organization….and he has to be knowledgeable — but not necessarily hands-on — on technology and company’s data to understand what can be done but, at the same time, to have a strong grasp of the business to understand how to maximize short term impact, maximize adoption of produced artifacts, control the necessary investment and the consequent talent strategy (i.e. how to find/hire/groom/retain exceptional data talent).

(3) What a (modern) Data Scientist should be?

If not end-to-end, the question is: what exactly a (modern) data scientist should be? I have summed up the question into eight bullets:

3.1 Hands-On: A data scientist may not be end-to-end…but he/she still produces software. Good coding practices (such as documentation, collaborative tools and testing) have to be there. Experience on production systems is nice-to-have…but not mandatory. However, let’s be careful to not go to the other extreme: data science is not a pure research work. Arguably, you will monetize anything that lives inside a whiteboard only. Of course that, if you do a breakthrough on your area, that will give your company a competitive edge. Hence, bottomline is as follows: Blue sky research has its place in academia and we are not all Einsteins. Doing something tangible that can solve the problem at hand in the near future is the way to go.

3.2 Expertise: He/She must know his/her stuff. Meaning that he/she understands when to apply the existing methods and how they actually work from a mathematical/statistical/optimization point of view. Today, there is a plethora of online resources to go to for this knowledge…but a good formal training in computer science basics and STEM is a great starting point.

3.3 Technology Aware: If you are a data scientist working in industry in 2020, you are most likely using either R or Python. You must be aware of the advantages/limitations of these two languages and when to use one and not the other. You have to be aware how cloud systems work (basics) and how ML models run in production (What is Docker?). Similarly, it is expected from a Machine Learning Engineer to know the basic differences between RandomForests and an Artificial Neural Network. Awareness reduces friction.

If even Batman can team up with Superman…why can’t you do the same?

3.4 Teamwork: You are a data scientist…but you will work in a multidisciplinary team. Different people, different skill sets, different career paths…you have to be ready to get the best from these people. And that usually happens when you work as a team. Superman just exists in Marvel…and I always preferred the Batman anyway.

3.5 Problem Solving Skills: Most of data science problems are true puzzles…almost a research problem that could lead to a PhD thesis (if time and funding allowed to). There is no free lunch formula to solve them nor a single possible optimal solution. You need to be able to both raise and exclude hypotheses, prioritize solution paths and be as exhaustive as time allows on your solution search process.

3.6 Specialization: Typically, if you want a Data Science team to have impact, you need to have (at least) three work-streams inside it: Data Science R&D, Data Science Product Owner and Data Engineer/ML Engineering/MLOps. As a Data Scientist, you may want to specialize in one of the first two work streams. If you are looking to be a great Data Scientist for R&D, you need to pick 1+ DS fields to be really good at out of the following (suggestions): NLP, CV, Reinforcement Learning, Predictive Analytics, Recommendation or Search. If you are looking more to be a great Product Owner, I recommend you to specialize on an application domain. Some of the most mainstream ones (my suggestion) are Finance, Mobility, Retail/E-commerce, Healthcare and Automotive.

3.7 Rigorous Evaluation Protocols: This is for yesterday. I’ve applied function X from library Z…fit, predict, bang! It seems to be working. Let’s deploy it!” Well, this is a movie that I saw a lot of times…and let me say “Run, Forrest, run!”. One essential thing is that you know what you are doing. Another equally important thing is that you know how to demonstrate…that you know what you are doing. Evaluation protocols set much of the success or the failure of a data science project in real-world. And note…I am not necessarily talking only about the evaluation metric or the loss function that you use. I am talking about simulation, large scale experimentation(e.g.: A/B testing), unit testing, baseline setting, single vs. batch tests, global and local interpretability, post-processing and model threshold selection, in-docker tests and business awareness (with business-related metrics), p-values and statistical tests, etc.. When we put all these procedures together following some application logic to validate a hypothesis, we achieve an evaluation protocol. It is indispensable that, on receiving requirements for a project, you align upfront with your business stakeholders on what that protocol will be:

What “success” means? And how will it be measured?

Typically, business stakeholders will give you a qualitative and ambiguous answer for those questions. To translate that into an evaluation framework with quantitative outputs and targets to those numbers is a task for you, data scientist. To sell that framework back to them as something similar to the answer that you’ve got initially…well, it’s also on you. And that is nice: basically, you translate business to tech (Data Science) and then backwards…that encloses a great sudo-like power…if you can do it well. But…”with a great power comes a great responsibility”. What brings me to my next point…

3.8 Storytelling and Communication: Tell me something that I can understand. Sometimes I say in my keynotes that Data Scientists have to be smart enough to connect the Business Requirements, the Existing Data and the Available Mathematical Tools/Software Libraries in order to create value. What I often forget to say next is that they need to be able to explain that as their audience were 5 years old kids— otherwise, they may be developing solutions that will not be understood, valued, trusted nor used in practice (example quoted here). And let me tell you two things about that: to explain something well, you need to know it better…and to explain complex stuff in simple terms requires a lot of expertise :)

Dixi. Finis.

Data Scientists need to be doing…data science. Learning about data science, maths, optimization, statistics, methods, applications…solutions, advantages and shortcomings. It requires attention to detail. An ability to learn something fast and to adapt to new problems. Data Science is not a list of keywords (and way less a tech stack). Data Science is about connecting the dots. Trying, learning and failing…but trying something that is worthy to try out. Something that makes sense…something sound from statistical/mathematical point of view. Data Science is about having the infinite curiosity of asking “what if?” and, at the same time, of trying to answer that question. It is about doing it better and not simply getting it done. It is about showing that by A+B that our solution can add up to X% revenues to our company. The best way that a data scientist has to add more value to their organization…is by doing data science. At least until AutoML takes over completely the Data Scientist job… :) Dixi.

In case you are still not convinced of that at this point (well, thank you for reading it anyway), let me share a small spoiler with you: Soon, I will be publishing a walk-through on a data science solution on a application case (i.e. Credit Risk Assessment), highlighting how important it is to know your stuff before approaching a real-world data (yes, data geeks…I will be providing free real-world data) to solve a problem. Stay tuned on my LinkedIn for updates.

P.S.: Please feel free to contact me to provide feedback, to ask for career guidance on how to become a better “zero-stack data scientist” (data scientist folks) or to request consultancy services on how to create and implement a successful “zero-stack data strategy” in your business (founders and business owners).

P.S.: I would like to personally thank to Fernando Costa, Sven Thies, Jihed Khiari and Marcia Oliveira the time they devoted on reviewing this post. Kudos to the four of them.

<< Zero-Stack Data Scientist — Part II, The Fall

--

--

Luis Moreira-Matias

Ph.d. in Machine Learning. Ms.c. in Software Engineering. Data Science Leader. Data Strategist. Keynote Speaker. Award Winner and Scientific Author on ML/DS.