MLOps — Is your Machine Learning model production ready?

Published in

Towards Data Engineering

6 min readMar 26, 2023

Once upon a time, there was a large e-commerce company that relied heavily on machine learning models to personalize their customer experience. However, they were facing several challenges in deploying and managing these models, which were causing delays in releasing new features and products. They often experienced long deployment times and had trouble maintaining the performance of their models, leading to a suboptimal customer experience. To solve this problem, they needed to implement an MLOps strategy to streamline their machine learning operations and ensure their models were performing at their best.

What is MLOps?

MLOps, short for Machine Learning operations, is a set of best practises and processes used to streamline the development, deployment, and monitoring of machine learning models in production environments. MLOps combines the principles of DevOps with the unique requirements of machine learning to provide a more reliable and efficient way of managing machine learning models.

Examples of Uber and Netlfix Benefitting from MLOps

1. Uber

Uber is a ride-sharing company that connects drivers and riders through its app. The company uses machine learning to make data-driven decisions that enable services like dynamic pricing, driver-rider pairing, ETA prediction, and other core business needs. Uber also implements machine learning solutions in its other businesses like UberEATS, uberPool, and its self-driving car division.

As per their source, to operationalize their machine learning models, Uber built an internal ML-as-a-service platform called Michelangelo. The platform covers the end-to-end ML workflow, including managing data, training models, evaluating them, deploying models, making predictions, and monitoring predictions. Michelangelo enables the Uber team to build, deploy, and operate machine learning solutions at scale seamlessly.

Uber successfully transitions its models from development to production in three modes through the Michelangelo platform. For models that need to serve real-time predictions, Uber implements an online prediction mode. Trained models are packaged into multiple containers and run as prediction services within a cluster online. The prediction service accepts individual or batch prediction requests from clients for real-time inference.

Models that have been trained offline are packaged into a container and run in a Spark job in offline prediction mode. The deployed models can generate offline/batch predictions whenever there’s a client request or on a repeating schedule. Models that are deployed this way are useful for internal business needs that do not require live or real-time results.

For embedded model deployment, models are deployed on mobile phones through Uber’s applications for edge inference. Uber uses PyML for flexibility in development and deploying trained models to production. The backend of the platform uses the Cassandra database as a model store.

Uber monitors thousands of models at scale through Michelangelo. The platform publishes metric features and prediction distribution over time so teams or systems can spot anomalies. Uber also logs model predictions and joins them to the observations generated by their data pipeline to observe whether the model is getting its predictions right or wrong. The company uses the Data Quality Monitor (DQM), an internal data monitoring system, to automatically find anomalies across datasets and trigger an alert on the data quality platform.

Uber uses Manifold, a model-agnostic visual debugging tool for machine learning, to debug the performance of models during development and when deployed to the production environment. The company uses Michelangelo to audit and conduct traceability for data and model lineage. This includes understanding the path a model takes from experimentation, what dataset it was trained on, and which of the models has been deployed to production for what specific business use-case.

More detail here

Meet Michelangelo here

2. Netflix

Netflix is a popular TV show and movie streaming platform that has revolutionized the way we watch shows and movies online. It uses machine learning to personalize the experience of its customers and provide optimal content for their users. Netflix’s use cases for machine learning range from catalog composition to optimizing streaming quality, recommending shows to produce, and detecting anomalies in a user’s sign-up process.

Netflix deploys models both online and offline and also performs "nearline" deployment, where models are deployed to an online prediction service but don’t need to perform real-time inference. Models are trained, validated, and deployed offline as a prediction service through an internal publication and subscription (or pub/sub) system. The models are trained on historical viewing data, tested offline for performance, and then deployed to live A/B testing to see how they perform in production.

Netflix uses Metaflow, an open-source machine learning framework-agnostic library that helps data scientists rapidly experiment by training machine learning models and effectively managing data. The platform also uses Meson, an internal orchestration engine, for workflow orchestration, and Runway for model lifecycle management.

Netflix uses internal automated monitoring and alerting tools to monitor data quality and detect data drift. The platform also uses Runway to monitor and alert the ML teams for models that are stale in production. It visualizes the application clusters that consumed a model’s prediction down to the model instance so that system metrics and model loading failures can be effectively monitored.

In summary, Netflix uses MLOps to personalize the experience of its customers, optimize content, and provide recommendations for shows to produce, among other things. It deploys models both online and offline, uses Metaflow for training and managing models, Meson for workflow orchestration, and Runway for model lifecycle management and monitoring.

Learn more about Netflix architecture

Meet Metaflow here

Implementing MLOps in AWS, Azure, and GCP

Implementing MLOps involves several steps, including setting up a development environment, building a model, deploying the model in a production environment, and monitoring its performance.

MLOps can be implemented with various cloud providers, including AWS, Azure, and GCP. Each cloud provider has its own unique set of tools and services to enable MLOps, but the general pipeline is as follows:

Data Preparation: Collecting and cleaning data to train and validate the model.
Model Development: Building the machine learning model, testing, and validating it.
Model Deployment: Deploying the model to a production environment, automating the process.
Model Monitoring: Monitoring the model’s performance and generating alerts if there is any deviation.

Advantages of MLOps

Implementing MLOps provides several advantages to businesses, such as:

Streamlining the development and deployment process of machine learning models.
Reducing the time and effort required to deploy models in production environments.
Providing a more reliable and efficient way of managing machine learning models.
Improving the performance of machine learning models over time through continuous monitoring and optimization.
Enabling businesses to scale their machine learning operations more easily.

Disadvantages of MLOps

Despite its advantages, there are some potential disadvantages to implementing MLOps, such as:

Increased complexity and cost due to the need for specialized tools and expertise.
Potential security and privacy risks associated with storing and processing large amounts of data.
The need to constantly monitor and optimize models to ensure they are performing at their best.

Concluding Words

In conclusion, implementing MLOps is essential for businesses that rely on machine learning models. By streamlining the development, deployment, and monitoring processes of machine learning models, businesses can improve their performance, reduce the time and effort required to deploy them in production environments, and scale their machine learning operations more easily. While there are potential disadvantages to implementing MLOps, its advantages outweigh them, making it a worthwhile investment for any business that wants to stay ahead in the competitive world of machine learning.