A Quick Intro to MLOps

ML Code vs ML System, MLOps vs DevOps, and MLOps Maturity Stages

Published in

NLPlanet

9 min readJun 6, 2022

Hello fellow NLP enthusiasts! Today we take a break from talking about new state-of-the-art models, and we dicuss on what are the challenges that our ML models face when released in production. Indeed, the ML model is just a single part of the whole ML system that allows our projects to be successful in the real world. In this article, we talk about MLOps practices and how to incrementally adopt them in our projects. Enjoy! 😄

Differences Between ML Code and ML System

Today all the necessary requirements for building great Machine Learning (ML) models are within reach, such as:

The availability of huge amounts of data;
Cheap, fast, and on-demand computing resources with specialized accelerators;
Continuous advancements in the ML field.

However, in a typical machine learning project, the challenge is not building the ML model, but building a whole ML system that allows the ML model to operate correctly and efficiently in production.

Indeed, a complete ML system deals with several other aspects such as data collection, data verification, testing, model analysis, monitoring, the serving infrastructure, and so on. All these tasks contribute to the success of the project, but they require different competencies that are not always covered by data scientists, who specialize more in modeling and less in software engineering and automation.

The difference between ML code and ML system. Image from https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning.

This problem resembles what the software development industry faced in its early stages, which brought the evolution of a set of practices that are today referred to by the term DevOps (Development + Operations). By operations, we mean the people and processes whose job involves planning for and executing activities such as operating production software applications, monitoring system performance, testing the application after any changes are made, and tuning a releasing updated software system.

As we’ll see later in this article, many of these principles apply to ML projects as well, and they constitute a set of practices referred to MLOps (Machine Learning + Operations) or ModelOps (Model + Operations).

Differences Between MLOps and DevOps

Let’s start with a brief overview of what DevOps is.

DevOps is a set of practices that standardize and streamline how we release developed software in a production environment. These practices advocate for testing and automation and, among other things, have evolved into the CI/CD (i.e. Continuous Integration and Continuous Delivery) pipelines that we see today.

With Continuous Integration, we mean the automatic building and testing of new code written and its merge into a shared code base, which is continuously updated by team members every day. In this way, the shared code base is always in a production-ready state and ready to be deployed.
If new changes are successfully integrated into the shared code base, Continuous Delivery allows for the automatic deployment of the updated software.

Using fully automated CI/CD pipelines allows for short release cycles and thus receiving quick feedback on the new product updates, which is necessary when working in Agile.

Now that our knowledge of DevOps is refreshed, we may ask ourselves: can we use the DevOps practices, developed for software systems, for ML systems as well?

Since an ML system is inherently a software system, we can definitively apply DevOps practices to ML systems too. However, there are some differences between typical software and ML projects:

Team skills: Data scientists may not have the software engineering competencies (e.g. design patterns, DRY principle, testing, linting) to write production-level code.
Development process: Data scientists need to quickly experiment with different solutions for the business problem to find the best one. For this reason, they often work on IDEs that allow fast experimentation and data visualization (i.e. Jupyter Notebook or Jupyter Lab), which often lead to low-quality code and necessary refactoring when releasing the model in production.
Testing: In an ML system we need to (1) test the data fed to our model during training and prediction (e.g. check that data always has the same schema), (2) test the model output quality (i.e. check that the new model predicts good results for a set of test data of interest), and (3) make all the unit/integration/end-to-end tests that we would do in a software project.
Deployment: A new ML model may be trained and deployed automatically as a consequence of a data drift detected on new data.
Production: Even if the deployed ML system passes all the tests, its performance may deteriorate over time due to data drifts on new data. For this reason, the model performance should be monitored.

Considering these differences and integrating them with DevOps principles, we get to a new set of practices nowadays called MLOps.

MLOps as the intersection of Machine Learning, DevOps, and Data Engineering. Image from https://en.wikipedia.org/wiki/MLOps.

As with DevOps, MLOps practices should be implemented step by step, thinking about what makes sense for your specific project needs. Let’s see how to incrementally adopt MLOps practices by studying different MLOps maturity stages.

Incrementally Adopting MLOps Practices with Maturity Stages

Let’s see three different MLOps maturity stages.

Stage 1: Manual Process

A person passing an ML model to another person, with love. Modified photo obtained from a photo by Kelly Sikkema on Unsplash.

This is the starting stage of most ML projects. It’s a manual process as the data scientists create the models at first, and then the operations release them into production.

In this stage, data scientists train models locally on their machines or on cloud within Jupyter notebooks. Once a trained model has been exported, the data scientist and the operations work together to write the code for the prediction module, which loads the trained model and exposes an API for prediction. This code is then released in production by the operations. Both the training and prediction codes are versioned on Git, even though it’s not easy to collaborate on Jupyter Notebook files as Git is not able to correctly compute diffs on them. Training data are usually stored on an unstructured database (e.g. Google Drive and AWS S3) and accessed with Git LFS.

Here are some additional improvements that you can make to your MLOps process in this stage:

Code testing: Write unit tests for the code used inside the training notebooks and the prediction services. You may want to write a continuous integration pipeline that executes tests from Jupyter notebooks, or just execute the tests locally.
Train models on dedicated machines: Data scientists may experiment locally on their machines, and then train the final version of the model on bigger, more optimized, and on-demand machines on cloud. Nowadays it’s very simple to get a Jupyter notebook-like environment on on-demand machines with several Cloud providers (e.g. Amazon SageMaker Notebook Instances, Google Cloud Platform Vertex AI Workbench).
Make experiments reproducible: By keeping track of the model lineage (e.g. what code has been used for training, what runtime environment, what dataset, what hyperparameters) it’s possible to reproduce experiments by re-training models and checking their metrics.

This MLOps stage is usually good enough for working on small to medium projects where the models need to be updated at most 2/3 times per year. More specifically, this stage works well when (1) models don’t need to be continuously re-trained due to a very dynamic environment, and (2) you need just a good-enough solution to your business problem as it makes little economic sense to test many other models and squeeze the performance improvements.

Stage 2: Manual Process + Model Quality Monitoring

The next MLOps stage is similar to the previous one as the models training and their release into production are still manual processes. However, we approach now the very important task of monitoring the model quality, which makes sure that the whole system is still solving the same problem it has been designed for. There are multiple ways of checking that the model is still working as intended:

One way is by checking data drifts on input data used during prediction. This means that each input data used for prediction must be saved and, once there are sufficient data, the distribution of the latest batch of input data must be compared to the distribution of the data used in training the model. If the distribution differs too much, we may raise an alert, since the model is predicting based on data different from the data it has been trained on.
A better check can be done by monitoring the quality of the model outputs. In this case, we must collect both the prediction inputs and outputs which will be then used, together with ground truth predictions, to compute the quality metrics that we need to monitor. Since we need ground truth predictions, the metrics may be computed automatically with an automatic evaluation job once the ground truth predictions have been collected. If the quality metrics decrease too much, we may want to retrain the model on new data (e.g. the previous training data with the addition of the new labeled samples used during evaluation). When comparing the performance of different models, pay attention to evaluating them on data not used as their training data.

Similar to how we develop experiments in a reproducible way by keeping the lineage of the model, it’s preferable to act similarly with the metrics lineage, i.e. keeping track of what model has been used and what datasets.

This MLOps stage is suitable for projects similar to the ones of the first MLOps stage and when we are not sure about the stationarity of the prediction data, and thus we may need to re-train the model up to 10 times per year. Nonetheless, it’s always advisable to monitor the quality of the predictions of your models, especially when they have a high business impact.

Stage 3: Full CI/CD/CT

The third MLOps stage is actually a big step forwards: in this stage, data scientists provide the packaged training code to the operations, which are then able to train models on their own with different datasets or hyperparameters. This means that operations have now the full power to automate all the possible MLOps steps and build CI/CD/CT pipelines (where CT stands for Continuous Training, a new step of the continuous paradigm).

In this stage, the data scientists still experiment on Jupyter notebooks, but then refactor the training code of the final model into Python packages (help from operations may be needed based on individual competencies) and push it on a Git repository. The training code is parameterized and can be run with different datasets and hyperparameters.

Using the training code, the operations can write pipelines that automate steps such as:

The model re-training when its performance deteriorates (i.e. Continuous Training).
Hyperparameter tuning in parallel in cloud.
The deployment of a new model into production substituting a previous model if the tests on model quality succeed and its metrics are better.
The reproduction of experiments.

Moreover, as the training code is now inside a Python package, it becomes easier to create a Continuous Integration pipeline that runs tests on it.

This last MLOps stage is currently the standard way of operating in the biggest and most ML-savvy companies. It’s recommended for large projects where lots of data scientists must collaborate and there’s huge business value involved that justifies building such an ML system.

Conclusions and next steps

In this article, we learned about the differences between building an ML model and releasing a reliable ML system. Then, we talked about DevOps practices and discussed the differences between typical software projects and ML projects. Last, we saw how to incrementally adopt MLOps practices and when it makes sense for your project to adopt more of them.

Possible next steps are:

Test open-source MLOps platforms and tools, such as MLFlow, Kubeflow, and DVC.
Test managed MLOps platforms, such as Weights & Biases, Amazon SageMaker, and Google AI Platform.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, Twitter, and join our new Discord server!