Overview of MLOps

Luis Bermudez
machinevision
Published in
10 min readMar 12, 2024

As Machine Learning (ML) models are increasingly incorporated into software, a nascent subfield called MLOps (short for ML Operations) has emerged. MLOps is a set of practices that aim to deploy and maintain ML models in production reliably and efficiently. It’s widely agreed that MLOps is hard. At the same time, it’s unclear why MLOps is hard. In this article, we aim to bring clarity to MLOps, specifically in identifying what MLOps typically involves — across organizations and ML applications.

The most common MLOps tasks (Left). The core tech stack for MLOps (Right).

MLOps Stack

First, we take a look at the MLOps tech stack that Machine Learning Engineers (MLEs) need to work with to develop the machine learning models and deploy them. These are the four layers of the MLOps stack.

  1. Run Layer: This layer manages hyper-parameters, data, and experiments. A run is a record of an execution of an ML or data pipeline (and its components). Run-level data is often managed by data catalogs, model registries, and training dashboards. Examples include Weights & Biases, MLFlow, Hive metastores, AWS Glue.
  2. Pipeline Layer: This layer is designed to facilitate the development, training, deployment, and management of ML models. Finer-grained than a run, a pipeline further specifies the dependencies between artifacts and details of the corresponding computations. Pipelines can run ad-hoc or on a schedule. Pipelines change less frequently than runs, but more frequently than components. Examples include Sagemaker, Papermill, DBT, Airflow, TensorFlow Extended.
  3. Component Layer: This layer allows you to create the ML model, including model training, feature generation and selection. A component is an individual node of computation in a pipeline, often a script inside a managed environment. Some ML teams have an organization-wide “library of common components” for pipelines to use, such as feature generation and model training. Examples include PyTorch, TensorFlow, Python, Spark.
  4. Infrastructure Layer: This layer provides computing resources to store and use the ML model. It provides the underlying infrastructure for running ML pipelines. It also includes cloud storage (e.g., S3) and GPU-backed cloud computing (AWS, GCP). It changes less frequently than other layers but with more significant consequences. Examples include AWS, Google Cloud Platform (GCP), Docker.

Each layer serves a specific purpose in managing the ML lifecycle, from execution and dependency management to computation and infrastructure provision. As the stack gets deeper, changes become less frequent: training jobs run daily but Dockerfiles are modified only occasionally.

MLOps Tasks

Across the MLOps tech stack, successful Machine Learning Engineers (MLEs) are needed to operationalize Machine Learning. The process of operationalizing Machine Learning consists of a continual loop of these four routine tasks.

  1. Data Collection: This involves gathering data from various sources, organizing it into a centralized repository, and cleaning it. Labeling data involves assigning correct labels or annotations to the data points, which can be done either in-house or outsourced to annotators.
  2. Model Experimentation: ML engineers focus on improving the performance of machine learning models by experimenting with different features and model architectures. This can include creating new features, modifying existing ones, or changing the model architecture itself. The performance of these experiments is typically measured using metrics such as accuracy or mean-squared-error.
  3. Model Evaluation: Once a model is trained, it needs to be evaluated to ensure its performance meets the desired standards. This involves computing metrics such as accuracy on labeled data points that were not seen during training. If the model performs satisfactorily, it can then be deployed into production. Deployment involves careful review of the proposed changes, possibly staging them to a portion of the user base, conducting A/B testing, and maintaining records of changes for potential rollbacks.
  4. Model Maintenance: After deployment, it’s crucial to monitor the ML pipelines for any issues or anomalies. This involves tracking live metrics through queries or dashboards, investigating the quality of predictions for different sub-populations, and addressing any failures or bugs that arise. This might include patching the model with non-ML heuristics for known failure modes and updating the evaluation set with real-world failures.

The next three sections discuss common strategies from ML Engineers for successful Model Experimentation, Evaluation, and Maintenance.

Model Experimentation

ML Engineering is very experimental and iterative in nature, especially compared to software engineering. Many experiments don’t ever make it into production. It’s important that experiment ideas can be prototyped and validated quickly. These are some common strategies that ML engineers use to generate successful experiment ideas.

  1. Collaboration and Idea Validation: Good project ideas often stem from collaboration with domain experts, data scientists, and analysts who have conducted exploratory data analysis. Collaborative discussions, asynchronous communication, and cross-team cooperation help validate and refine project ideas.
  2. Iterating on Data: ML engineers emphasize the importance of iterating on the data rather than solely focusing on the model. Experimentation often involves adding new features, improving feature engineering pipelines, and iterating on the dataset to enhance model performance.
  3. Accounting for Diminishing Returns: ML projects typically undergo staged deployment, where ideas are validated offline and gradually deployed to production traffic. Engineers prioritize experiment ideas with the largest performance gains in the earliest stages of deployment, considering the diminishing returns of later stages.
  4. Preferential Treatment for Small Changes: ML engineers follow software best practices by making small, incremental changes to the codebase. Changes are kept small to facilitate faster code review, easier validation, and fewer merge conflicts. Config-driven development is emphasized to minimize bugs and ensure reproducibility.

These strategies collectively contribute to the success of ML experiments by fostering collaboration, optimizing data utilization, prioritizing high-impact ideas, and maintaining code quality and reproducibility.

Model Evaluation

Model evaluation efforts need to keep up with changes in data and business requirements. It’s important to keep bad models from making it to production while maintaining velocity. These are some common organizational efforts to effectively evaluate models.

  1. Dynamic Validation Datasets: Engineers analyze live failure modes and update validation datasets to prevent similar failures from occurring. This process involves addressing shifts in data distribution and subpopulation performance, often through systematic categorization of data points based on error patterns.
  2. Standardized Validation Systems: Organizations aim to standardize validation processes to maintain consistency and reduce errors stemming from inconsistent evaluation criteria. This standardization helps ensure that models are evaluated comprehensively and consistently before deployment.
  3. Multi-stage Deployment and Evaluation: Many organizations employ multi-stage deployment processes, where models are progressively evaluated at each stage before full deployment. Stages such as testing, shadow deployment, and limited user rollout allow for early detection of problems and validation of model performance.
  4. Tying Evaluation Metrics to Product Metrics: It’s crucial to evaluate models based on metrics that are critical to the product’s success, such as click-through rate or user churn rate, rather than solely relying on ML-specific metrics. This alignment with product metrics ensures that models are evaluated based on their real-world impact and value to the organization.

These efforts collectively aim to ensure that models are thoroughly evaluated, validated, and aligned with the organization’s objectives and requirements before deployment to production.

Model Maintenance

Sustaining high performance models requires deliberate software engineering and organizational practices. These are some common strategies that ML engineers use during monitoring and debugging phases to sustain model performance after deployment.

  1. Creating New Versions: ML engineers frequently retrain models on live data to ensure that model performance does not suffer from data staleness. This involves automatically retraining models on a regular cadence, ranging from hourly to every few months, and frequently labeling live data to support retraining.
  2. Maintaining Old Versions as Fallback Models: To minimize downtime when a model is known to be broken, engineers maintain fallback models, either old versions or simpler versions, to revert to until the issue with the production model is resolved.
  3. Maintaining Layers of Heuristics: Engineers augment models with rule-based layers to stabilize live predictions. These heuristics help filter out inaccurate predictions based on domain knowledge, ensuring that only reliable predictions are served to users.
  4. Validating Data Going In and Out of Pipelines: Engineers continuously monitor features and predictions for production models, implementing various data validation checks to ensure data quality and consistency. These checks include enforcing constraints on feature values, monitoring data completeness, and verifying schema adherence.
  5. Keeping it Simple: ML engineers prefer simplicity, relying on simple models and algorithms whenever possible to streamline post-deployment maintenance. While some choose deep learning models for their ease of use and interpretability, others opt for simpler models to avoid overfitting and facilitate debugging.
  6. Organizational Support: Organizations implement various processes to support ML engineers in sustaining models, including on-call rotations for supervising production models, central queues for tracking and prioritizing production ML bugs, and Service Level Objectives (SLOs) to ensure minimum performance standards for pipelines.

These strategies collectively help ML engineers maintain the performance and reliability of production ML pipelines, ensuring that models continue to deliver accurate predictions and meet the organization’s requirements and standards.

Conclusions

Across the four layers of the MLOps tech stack (run, pipeline, component, infrastructure) and across the four routine tasks for MLOps (data collection, experimentation, evaluation, maintenance), there are three common variables that dictate how successful the model deployment will be when developing and pushing ML models to production.

  1. Velocity: This refers to the speed at which ML engineers can prototype, iterate, and develop new ideas into trained models. High experimentation velocity allows for rapid testing of hypotheses and quicker development cycles.
  2. Validation: It involves testing changes, identifying and pruning bad ideas, and proactively monitoring pipelines for bugs as early as possible. Early validation helps to catch errors before they become more expensive to handle, particularly when users encounter them.
  3. Versioning: This involves storing and managing multiple versions of production models and datasets. By maintaining version control, ML engineers can query, debug, and minimize production pipeline downtime. It also enables the quick switch to alternative versions in response to bugs or issues in the current production model.

ML Tools at each layer of the MLOps tech stack should aim to enhance user experiences across the Three Vs: Velocity, Validation, and Versioning. For example, experiment tracking tools increase the velocity of iterating on feature or modeling ideas. In another example, feature stores (i.e., tables of derived features for ML models) help debug models because they offer reproducibility of historical versions of features used in training such models.

Improvements

Despite all of these best practices for MLOps, there are still many areas that need improvement. We need additional tooling to address the pain points within MLOps. These are some of the common pain points within MLOps.

1. Mismatch Between Development and Production Environments: This includes issues such as data leakage, differing philosophies on the use of Jupyter notebooks, and non-standardized code quality practices.

  • Data leakage occurs when assumptions made during model training are not valid at deployment time, leading to incorrect predictions. It ranges from issues like using the same data for training and downstream tasks to ignoring feedback delays during training, impacting model performance in production.
  • Opinions on Jupyter Notebooks. There are divergent opinions among practitioners regarding the use of Jupyter notebooks in ML workflows, with some favoring them for their quick prototyping abilities. However, concerns about reproducibility and code quality lead others to avoid using notebooks in production, emphasizing the importance of balancing velocity with environment consistency.
  • Code quality standards and review practices in ML development vary widely, with some organizations lacking ML-specific coding guidelines. While code review can enhance deployment reliability, it may be perceived as a hindrance to agility in experimentation, leading to inconsistencies in maintaining code quality across different stages of ML projects.

2. Handling a Spectrum of Data Errors: ML engineers struggle with hard errors, soft errors, and drift errors in data, leading to false-positive alerts, alert fatigue, and challenges in creating meaningful data alerts.

  • Hard errors are obvious and result in clearly “bad predictions”, such as when mixing or swapping columns or when violating constraints (e.g., a negative age).
  • Soft errors, such as a few null-valued features in a data point, are less pernicious and can still yield reasonable predictions, making them hard to catch and quantify.
  • Drift errors occur when the live data is from a seemingly different distribution than the training set; these happen relatively slowly over time.

False-positive alerts happen most often. They’re triggered even when ML performance is acceptable and leads to alert fatigue among engineers. This causes them to ignore or silence alerts and potentially miss actual performance issues. Alert fatigue management becomes crucial during on-call rotations, with initiatives aimed at reducing noise in alerts and improving the precision of alerting criteria. Creating Meaningful Data Alerts is Challenging. One method might generate too many false positives, and another method might not catch enough of the errors. Finding a ‘Goldilocks’ method is difficult.

3. ML Bugs Are Unique: ML debugging is different from debugging during standard software engineering, where one can write test cases to cover the space of potential bugs. ML Debugging is difficult to categorize bugs effectively because every bug feels unique.

4. Multi-Staged Deployments Seemingly Take Forever: End-to-end experimentation in ML deployments takes too long, leading to uncertainties and the potential abandonment of promising ideas.

Even More Improvements

As you can imagine, there’s even more areas that need further improvement. These are some common anti-patterns within MLOps.

  1. Industry-Classroom Mismatch: There is a gap between the skills learned in academic settings versus those required in industry. This leads to a lack of preparation for real-world ML challenges.
  2. Keeping GPUs Warm: There is a compulsive need to utilize all computational resources and run random experiments in parallel, instead of prioritizing the most impactful experiments and running those in sequence as insights continue to grow.
  3. Retrofitting an Explanation: There is pressure to provide explanations for successful ML experiments for successful collaboration and customer satisfaction. So explanations are created after the experiment is run, instead of having principled reasons as to why the experiment was run in the first place.
  4. Undocumented Tribal Knowledge: Pain points arise due to undocumented knowledge about ML experiments and pipelines. Team members are learning faster than documentation can be updated. Automated documentation assistance for ML pipelines would be helpful.

Overall, addressing these anti-patterns requires a combination of educational initiatives, tool development, and process improvements to enhance the effectiveness and efficiency of MLOps practices.

If you enjoyed this article, please to help others find it!

I would like to thank Shreya Shankar for conducting the interviews that this introductory article was based on. To see more details of these interviews, see the original paper: [Operationalizing Machine Learning: An Interview Study]

--

--