The MLOps Playbook: Best Practices for Ensuring Reliability of ML Systems

Author: Mo Messidi, a seasoned DataOps leader at Headspace with a decade of experience in both startups and enterprises. His mission is to help organizations operationalize their data.

This article presents a simple, yet comprehensive, set of MLOps best practices for organizations to assess the production readiness of machine learning systems. It has also proved beneficial for assessing off-the-shelf MLOps platforms for feature and functionality completeness.

There may be a whole range of software engineering best practices towards producing trustworthy software, but similar best practices for machine learning system operations are only in their infancy.

Stage A: Model Development

  • Model code, data, parameters, and metrics are version controlled

It is important to know the code, data, and artifacts that produced a model. You can do this by having good version control for the model specification, including hyper-parameters and experiment artifacts. This ensures reproducibility, enables rollbacks, and de-risks system changes.

  • A simpler model is not better

The more complex a model, the higher its cost to operate. Adding a complexity tax to model assessment equations can help reveal the true incremental value of a given model.

  • Model quality is sufficient for all important data slices

ML models quality metrics can easily get lost in the averages when benchmarking against full datasets. It is important to examine quality independently for temporal and location variations. It is common for models to exhibit large drops in quality for specific data slices e.g. users in Denmark vs. users in Europe.

  • The model is tested for considerations of inclusion

ML unfairness may occur due to the way that people’s choices affect what training data is used for something like word embedding.

This can then lead to biased system behavior because it is based on these bad choices done during training data set creation.

Measuring what you are doing is important to make systems for everyone. For example, if you find that the features of input show that they correlate with protected user groups, then that might be a problem. You can also look at the predictions and see if they differ when conditioned by different types of groups.

Stage B: Model Deployment

  • Training is reproducible

If a model is trained twice on the same data, it should produce identical results; determinism simplifies reasoning about ML systems and aids audit-ability and debugging. However, model training isn’t always reproducible in practice, especially when working with non-convex methods such as deep learning or even random forests. There are several known sources of nondeterminism that can be addressed such as random number generation and initialization order. Even when initialization is fully deterministic, multiple threads of execution on a single machine or across a distributed system may be subject to unpredictable orderings of training data, which is another source of non-determinism.

  • Models are unit tested

A simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common code errors. It is not sufficient that code produces a model with high-quality predictions, but that it does so for the expected reasons.

  • ML pipelines are integration tested

A complete ML pipeline typically consists of assembling training data, feature generation, model training, model verification, and deployment to a serving system. Integration tests should run both continuously as well as with new releases of models to catch problems well before they reach the end-user.

  • Models are deployed via a canary process

There is always a risk when pushing changes to production. Canarying can help catch mismatches between model artifacts and serving infrastructure.

  • A rollback procedure is in place and regularly practiced

Rolling back to a previously known-good state is as important with ML models as it is with any other aspect of a serving system. Because rolling back is an emergency procedure, operators should practice performing it regularly.

Stage C: Model Monitoring

  • Dependency changes are tracked

ML systems take data from many other systems. If the source system changes, ML system changes are inevitable. A robust data observability and dependency tracking framework is essential for identifying root cause issues of model performance and/or quality drifts.

  • Training and serving features are tracked and compared

A phenomenon called “training/serving skew” occurs when code paths that generate the model inputs differ at training and inference time.

To measure this, it is important to maintain a log of serving feature values. Feature values should be identical for the same feature at training and serving time. Important metrics to monitor are the number of features that exhibit skew, and the number of examples exhibiting skew for each skewed feature.

  • Numerical stability is tracked and tested

It is important to watch out for numerical edge-cases such as division by zero or infinity. This can be tested using a specific set of unit tests that force ML systems to handle these cases appropriately.

  • ML system computational performance is tracked

Measuring computational performance metrics such as serving latency, throughput, or memory footprint is a standard part of monitoring. It is useful to measure performance metrics for different versions of the code and the data, but also different models. These can be early indicators of data edge-case issues or model staleness.

  • Online proxies for offline model quality metrics are tracked

Measuring model quality metrics in production is usually not possible due to the late arrival of truth labels. For example, true labels for churn models may take months to generate. To get round this issue, online proxies such as prediction distributions may be measured instead to provide an early indicator for model quality drift.

Together, model inputs tracking along with the model output distribution tracking are usually the best possible indicators of model drift in online, real-world settings.


With industrial machine-learning systems continuing to play an increasingly central role in real-world production settings, the question of ML reliability will continue to become increasingly critical. Following the set of best practices highlighted in this article can prevent unforeseen issues with ML reliability that cannot be found when assessing small toy models or even large offline experiments.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store