Clean Up Your Data Science Mess

Published in

Deepflow AI

6 min readJul 10, 2020

In Data Science, moving from experimentation to production is a challenge. Here’s how you can clean up the mess by implementing DataOps practices. This post was originally published on the Deepflow blog.

From experimentation to production the road ahead is not paved. While trial and error is the essence of the Data Scientist’s life, critical AI systems need clarity. For example, if you built a credit scoring AI that automatically grants loans to your customers, you need to understand how it makes decisions about who should or shouldn’t receive a loan. This is the data scientist’s job to ensure transparency, but it’s not self-evident.

For a data scientist, a typical workflow looks like the this: You find a dataset, explore it in a notebook, try modeling techniques, fail, repeat … until it works. After hours of adding new elements to a project, it’s not uncommon finding oneself with a sprawling monster of data sources, notebooks and CSVs. That’s when you start asking yourself, how do I put that in production? The “mess”, that is an inherent part of the data exploration and modeling process, now prevents for a smooth transition of the project to production. Progress starts to stall and time to production keeps getting longer. To add to the pressure, your boss wants your solution live before the end of the week. But what took months of iterations and back and forth won’t magically turn into production code. It’s a problem that engineers know too well: avoiding technical debt by refactoring legacy code. Messy and hard-to-understand systems are not only a technical challenge, they routinely impact the bottom line of companies big and small (read our article on Explainable AI). Needless to say, you need to clean up the mess.

From Jupyter notebook to production, an allegory.

The hand-off of projects from “research” scientists to the “production” engineers is more standardized in large companies where R&D is its own entity and processes are in place to facilitate communication. Data teams in small organizations bear the brunt of limited resources and short development cycles which adds even more pressure to avoid “messy” projects.

(In Enterprise AI) The most exciting problems yet to be solved are in the deployment and serving space. One reason for the lack of serving solutions is the lack of communication between researchers and production engineers. At companies that can afford to pursue AI research (e.g. big companies), the research team is separated from the deployment team, and the two teams only communicate via the p-managers: product managers, program managers, project managers. Small companies, whose employees can see the entire stack, are constrained by their immediate product needs. Only a few startups, usually those founded by accomplished researchers with enough funding to hire accomplished engineers, have managed to bridge the gap. These startups are poised to take a big chunk of the AI tooling market. Chip Huyen — “What I learned from looking at 200 machine learning tools”

Moving AI in production

In the software industry, the practice of managing the communication and hand offs between Developer and IT Operations is called DevOps. In practical terms, DevOps refers to the collection of industry practices that standardize how code is created, delivered, deployed and maintained in a consistent way from start to finish within a same team. It introduces two foundational concepts: Continuous Integration (CI) and Continuous Delivery (CD). While CI automates the building, packaging and testing of applications, CD automates the delivery of the applications to the different environments (e.g. development, testing, production).

Data Science works in a very similar way. After all, it’s another flavor of software engineering. But because of some of the differences we discussed in the previous part DevOps principles cannot be transposed exactly in the same way. That’s why DataOps is gaining interest in the industry, it goes beyond DevOps to offer a tailored approach to manage Data Science projects and avoid the “mess” from start to finish.

DevOps vs. DataOps — The DataOps Cookbook 2nd Ed.

The main reason for DataOps as a separate approach is that contrary to traditional engineering teams, Data Science teams are often composed of many non-engineer types, like analysts or research-focused data scientists. Their skill-set is different from engineers. Their goal is not push production-ready code but to find how data can be leveraged to answer specific problems, explore datasets to find answers, and build predictive models. Their focus is only on improving the accuracy of their models and producing actionable visualizations.

Another differentiating aspect is the range of tools used. The typical mindset of an engineering team is to create a standardized “sandbox” (isolated development environment to qualify code before pushing it to production) for every engineer in the team. When working on the same projects, every engineer sandbox will look exactly the same to make sure individual changes can be merged easily, thus speeding up development time and improving the quality of the overall code-base. In the context of data teams, creating a standardized environment is much more challenging. Data scientists and analysts tend to use their own tools. The exploratory nature of their work makes it almost a necessity. In this context, we prefer to talk about “orchestration” rather than “building” (see previous graph).

In conclusion, while DevOps deals with managing the complexity of the engineering life-cycle through centralization (aka. one DevOps team per company), DataOps aims to manage the freedom of Data Science through orchestration (aka. data teams are local and are distributed across the oragnization).

Continuous Delivery for Machine Learning end-to-end process — CD4ML

The future of Enterprise AI: business leaders with AI skills

As highlighted in the graph below, the biggest change in AI is happening behind the curtain. AI researchers, while still expensive and in scarce numbers, are not the coveted unicorns they once were.

Deloitte Insights — AI in the Enterprise

After years of educating themselves, an increasing number of companies are realizing that AI research is overkill for their business. The already available algorithms and techniques are more than sufficient to tackle existing problems, no need to reinvent the wheel. What is missing is the ability for organizations to understand how to leverage the state of the art into working solutions. Business leaders and managers need to be trained to understand and utilize existing AI solutions.

Many companies rush into the AI race without clear objectives, hope a brilliant AI researcher and a technology team can create something great without guidance, and end up with little to show for it. Recruiting an AI quarterback to provide the business input, and ensuring success with well-defined metrics, is the most important job that most companies miss. Beck, Davenport, and Libert — “The AI roles some companies forget to fill”.

Track your data team progress and reduce time to production.

With workflows that are production-ready by design, Deepflow makes it easy to manage the versioning, CI/CD, testing and monitoring of your models from conception to execution.

>>Join the next batch of early adopters<<

Clean Up Your Data Science Mess

Moving AI in production

The future of Enterprise AI: business leaders with AI skills

Track your data team progress and reduce time to production.

>>Join the next batch of early adopters<<

Written by Fabien Durand