MLOps Series, Part 2 — How we understand MLOps at DataSentics

Berkamilanml
DataSentics
Published in
6 min readMar 4, 2021

In the first part of the series, we went over the problems (and their causes) that companies face when authoring a data science (machine learning) solution and productionalizing it. This “ability to productionalize” is absolutely crucial— without it, even the greatest data science solution will end up in some dark vault (recycle bin) and will never affect anything. When we talk about this ability, we often use the term “MLOps” and in this article, we will unravel how we think about MLOps at DataSentics.

What is MLOps

First, let us reiterate what we exactly mean by MLOps (or AIOps), as this term is slowly becoming a new buzzword. In DataSentics, we say:

MLOps is a set of practices for managing and streamlining machine learning (ML) models’ lifecycle — from developement all the way to production.

MLOps is not a platform, it is not a tooling, it is not a single process — it is an entire system or culture of how to productionalize machine learning models / data science solutions. And the better this system is, the faster and more robust the productionalization of the machine learning solutions is.

Why DevOps != MLOps?

People that are familiar with software developement are probably asking: why yet another Ops? Cannot we go just with existing and already mature DevOps methodology with all its tools and processes? Why MLOps? The problem is that ML-powered application development != software development. Current DevOps practices work (very well) for the standard software development. However, developing an ML-powered applications is a different beast. I am a big fan of the Andrej Karpathy’s article which popularized terms “Software 1.0” for standard software applications and “Software 2.0” for machine learning applications, differentiating the two approaches. Now let’s go over what the two software actually represent:

Software 1.0 vs Software 2.0

In software 1.0, the logic of the application is captured in the code which is written by a software developer. When data comes in and the logic/code is applied, we get the desired outcome. In software 2.0, the logic of the application is captured by the machine learning model, which is trained by data scientist on top of real data. Word “training” is important — it means we use statistical methods (whether it is a neural network or linear regression) to which we feed data and the statistical methods output a “model” which encapsulates the logic we desire. Basically, the statistical method is writing the code for us :) Oh, I see I used the word “statistics” a lot, but let’s just say that “statistics = machine learning” — I admit the machine learning has a nicer ring to it.

Of course, as time goes by, the underlying data or behavior desired from a system may change, therefore logic of the model trained some time ago won’t reflect the current world and will deprecate. Similarly to software 1.0, where the software developers have to rewrite the code to fix the logic, in software 2.0 we have to retrain the model on the latest data, which will (hopefully) get the logic back on track.

Looking at it this way, MLOps can be seen as DevOps for Software 2.0.

Components of the machine learning model lifecycle

Now, let’s dive deeper into how the models are developed, turned into an application and upkeeped — we call this a machine learning model lifecycle.

Machine learning model lifecycle components and processes

As can be seen from the picture above, the ML model lifecycle can be very colorful. We can break it down into several steps/components:

  1. Business comes up with a problem to solve
  2. Data scientists think about how to solve this problem using data and machine learning
  3. Then they go out and try to find the right data
  4. They gather the right data
  5. They experiment and design new features for the model
  6. They train and fine-tune the model
  7. Once they have the model validated and are happy with the result, they register it
  8. Then the process of deploying the model (and the model features!!) to production is initiated. The goal is to make the model available to other business processes which can start leveraging it. The deployment strategy can vary depending on the type of the deployment (deployment as part of batch prediction pipeline, deployment as standalone API, take and bake the model into an existing application, etc). This is usually where the machine learning engineers take over.
  9. When the model runs in production, we need to monitor its performance — which includes the health of the service running the model, the statistical quality of predictions, and also the statistic quality of the input data.
  10. When the model starts to deteriorate it should be revisited, for instance, it can be automatically retrained or data scientists should have a look and maybe replace it altogether with a completely new model.

This entire ML lifecycle should be supported by a strong MLOps system as it must be ensured that we:

  • have a reliable source of data and ability to turn the data into inputs for the models: easy access to data, means to turn raw data into a model input data, versioning of the data, legal aspects, GDPR, …
  • have means to do the training efficiently: enough of the computational power, support for necessary libraries, unified environment, tests, data availability, …
  • know how we trained the model: which version of the training code, which version of data/features, which parameters, etc. produced the particular version of model, …
  • have a solid (re)deployment process: turning the model into application, testing, turning the training code into an application, …
  • have solid operations on top of the model application: we know which version of model is running in production, monitoring of the application, monitoring of the input/output data, …
  • can react to changing world and subsequent deterioration of the model: model performance monitoring and alerting process, retraining process, …

Multiple people have to talk to each other during this process: business guys with the data scientists, data scientists with the data engineers and ML engineers, ML engineers with the platform engineers and ops guys, and so on. And they don’t always understand each other very well. It is absolutely essential to have the right interfaces between all those. And this is what MLOps means to us.

MLOps subtopics at DataSentics

The problem is indeed very wide. Internally, we split the problem further into several subtopics:

  • Feature Store (one central place for storing and managing features (data inputs to machine learning models) across the company)
  • Experiment tracking (place where to track data scientific experiment runs)
  • Model registery (place where to store the “ML model bundles” — model artifacts and other necessary files/metadata about the model)
  • Model deployment (process of building, testing, and deploying both the model training application and model serving application)
  • Model operations (monitoring / logging / retraining / optimizing the model application)
  • Model reproducibility & portability & interpretability (explainability)
  • Standardization & reusability (template / API for data scientists to ease the development and deployment)
  • ExplainOps (management of model “explainers” and getting the explanation along with prediction)
  • MLaaS (Machine learning as a service = utilizing specialized ML services for common ML tasks such as Azure Cognitive services for face recognition or AWS Polly for text translation)

Getting all these aspects right ensures a great and seamless model development and productionalization experience for all people involved.

Conclusion

Our view on “MLOps” is probably a little wider than how others understand it. It is not only about deployment of the model artifact as an API — for us, it is really important to think about the model lifecycle holistically and cover all the steps and components.

This view stems from our experience running AI-powered products (both ourselves and of our clients). And as we saw the same problems and patterns emerging over and over again, our engineering decided to build an entire array of tools and frameworks and best-practice standards (called the “AI Suite”), which addresses the aforementioned complexities and which we now use for development of every new ML-powered product.

Hope you enjoyed the article — we are very interested in what is your take on MLOps, so definitely don’t hesitate to share your opinions with us :) Also, let us know if you find particular topics compelling and we can zoom on them in the following articles; or contact us directly and we can have a chat about it!

--

--