Scaling the Wall Between Data Scientist and Data Engineer

Byron Allen
May 22, 2019 · 6 min read
Image for post
Image for post
Source: Chris Gonzalez

This is the first article in a series of three, which focus on production ML and the intersection between data science and engineering. The other two are Trawling Twitter for Trollish Tweets and Deploying an ML Model to Production using GCP and MLFlow.

One of the most exciting things in machine learning (ML) today, for me at least, is not at the bleeding-edge of deep learning or reinforcement learning. Rather it has more to do with how models are managed and how data scientists and data engineers effectively collaborate as teams. Navigating those waters will lead organisations towards a more effective and sustainable application of ML.

Sadly, there is a divide between “scientist” and “engineer”. A wall so to speak. Andy Konwinski, Co-founder and VP of Product at Databricks, along with others point to some key hurdles in a recent blog post about MLFlow. “Building production machine learning applications is challenging because there is no standard way to record experiments, ensure reproducible runs, and manage and deploy models,” says Databricks.

The genesis of many major challenges in applying ML today — whether that be technical, commercial, or societal — is the imbalance of data over time coupled with the management, as well as utilisation, of ML artifacts. A model can perform exceptionally well, but if the underlying data drifts and artifacts are not being used to assess performance, your model will not generalise well nor update appropriately. This problem falls into a gray area that is inhabited by both data scientists and engineers.

Image for post
Image for post
Source: burak kostak

In other words, the crux of the problem is that the principals of CI/CD are missing in ML. It doesn’t matter if you can create a really good ‘black box’ model, if your environment changes, such as input data, and the model isn’t regularly assessed in the context of what it was built to do causing it to lose its relevance and value over time. This is an issue that’s hard to tackle because the people that are feeding the data in, engineers, and the people that designed the model, scientists, don’t have the happiest of marriages.

There are tangible examples of this challenge. Think about all those predictions saying Hillary Clinton was going to win amongst several other ML goofs. From self-driving cars killing an innocent pedestrian to prejudiced AIs, there have been some large missteps, which I would argue generally have origins in the gray area between data science and engineering.

Image for post
Image for post
Source: Kayla Velasquez

That said, negative and positive alike, ML impacts our society. More positive, and slightly less commercial, examples include the electricityMap, which uses ML to map the environmental impact of electricity all over the world; ML in cancer research is currently helping us to detect several cancer types earlier and more accurately; AI driven sensors powering Agriculture towards meeting the global skyrocketing demands for food.

The Wall

With that in mind, it’s critical to get production ML and more specifically model management right. However, coming back to the point, data scientists and data engineers don’t always speak the same language.

It is not uncommon for a data scientist to lack an understanding of how their models should live in an environment that continuously ingests new data, integrates new code, is called by end-users, and can fail in a variety of ways from time to time (i.e. a production environment). On the other side of the divide, many data engineers do not understand enough about machine learning to understand what they are putting into production and the ramifications for the organisation.

Far too often have these two roles operated without enough consideration for one another despite the fact that they occupy the same space. “That’s not my job” is not the right approach. To produce something that is reliable, sustainable, and adaptable, both roles must work together more effectively.

Scaling the Wall

The first step to speaking each other’s language is to build a common vocabulary — to have some kind of standardisation of the semantics, and therefore how the challenge is, or tangential challenges are, discussed. Naturally, this is fraught with challenges — just ask several different people what a data lake is and you’re likely to get at least two different answers, if not more.

I’ve developed common reference points that I call the ProductionML Value Chain and ProductionML Framework.

Image for post
Image for post

We’ve broken the process of productionising ML into five overlapping concepts which are too often considered separately. Whilst it may seem like introducing a holistic framework like this would increase complexity and interdependency — in practice those complexities and interdependencies already exist — and ignoring them is just kicking a problem down the line.

By allowing for consideration of neighbouring concepts in the design of your production ML pipeline — you begin to introduce that elusive reliability, sustainability, and adaptability.

ProductionML Framework

The ProductionML Value Chain is a high-level description of what is required to operate a data science and engineering team for the purpose of deploying models to end users. There is naturally a more technical and detailed understanding — I call that a ProductionML Framework (some might call this Continous Intelligence).

Image for post
Image for post
ProductionML Framework

This framework was developed after several rounds of experimentation with commercial MLOps tools, open source options, and the development of an internal PoC. It is meant to guide the future development of ProductionML projects, particularly the aspects of production ML that require input from both data scientists and engineers.

Image for post
Image for post
Data Science in orange and Data Engineering / Devops in blue

If you’re not familiar with those aspects, see data science in orange and data engineering / devops in blue.

As you can see, the “Training Performance Tracking” mechanism (e.g. MLFlow) and the Govern mechanism are centrally situated in this architecture. That is because every artifact, including metrics, parameters, and graphs, must be archived during the training and testing stages. Moreover, what is called Model Management is fundamentally tied to how the model is governed, which leverages those model artifacts.

The Govern mechanism combines artifacts and business rules to promote the appropriate model, or estimator to be more specific, to production while labeling others according to rules specific to the use case. This is also called model versioning, but the term ‘govern’ is used to avoid confusion with version control and emphasise the central role that the mechanism plays in overseeing model management.

A Golden Gun?

We’re all on this journey together. We’re all trying to scale the wall. There are a lot of great tools entering the market, but to date, no one has a golden gun…

Image for post
Image for post
Source: mrgarethm — Golden Gun — International Spy Museum

MLFlow makes great strides from my perspective, it answers certain questions around model management and artifact archiving. Other products similarly address relatively specific issues — albeit their strengths may be in other parts of the ProductionML Value Chain. This can be seen in Google Cloud ML Engine and AWS Sagemaker. Recently, the beta version of AutoML Tables beta was made available by GCP but even that does not deliver everything required out of the box, albeit does come much closer.

With that continued disparity in mind, it is absolutely critical to have a common vocabulary and framework as a foundation between scientist and engineer.

Is the wall too tall? From my experience, the answer is no, but that’s not to say ProductionML is not complex.

This article is the first in a three-part series related to ProductionML. Stay tuned for the next two.

Obligatory James Bond Quotes

M: So if I heard correctly, Scaramanga got away — in a car that sprouted wings!

Q: Oh, that’s perfectly feasible, sir. As a matter of fact, we’re working on one now.

Perhaps that’s how you should get over that wall…

Servian

The Cloud & Data Professionals

Byron Allen

Written by

Texan transplant to Australia turned Australian transplant to the UK | ML Engineer | Senior Consultant at Servian

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Byron Allen

Written by

Texan transplant to Australia turned Australian transplant to the UK | ML Engineer | Senior Consultant at Servian

Servian

Servian

At Servian, we design, deliver and manage innovative data & analytics, digital, customer engagement and cloud solutions that help you sustain competitive advantage.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store