Mastering MLOps with a Proven Solution Blueprint

Ashanti Terraneo
ELCA IT
Published in
8 min readJun 30, 2023

Authors: Ashanti Terraneo, Daniel Spicar

Machine Learning (ML) has revolutionized the way we approach problem-solving and decision-making. With the ability to learn from data and improve performance over time, the adoption of ML algorithms to support and automate complex tasks and processes is steadily increasing.

However, building and deploying ML models is not a one-time effort. It requires continuous management and monitoring of the entire ML application lifecycle to ensure the models perform as expected and deliver value.

This article describes the need for a dedicated Machine Learning Operations (MLOps) lifecycle. We present a case study and ELCA’s proven solution blueprint. Our design optimizes the efficiency of ML application operations for our customers, enables DevOps engineers to do maintenance, and accelerates new Data Science initiatives.

Photo by Jason Leung on Unsplash

The Challenge

Training and delivering well-performing ML models is just the tip of the MLOps lifecycle iceberg. ML applications are embedded in services or products and require continuous monitoring and maintenance to identify issues and avoid costly interventions. However, setting up a robust infrastructure for ML application operations presents many (often overlooked) challenges. Among them are:

  • ML model decay: degrading of model performance over time.
  • Little control over incoming data.
  • Missing support for model performance monitoring.
  • Specialist skill requirements for ML model development and maintenance. These skills may be in short supply.
  • The “changing anything, changes everything” (figure below) paradigm: Isolated changes to data, model, or code can trigger a cascade of changes across the entire ML lifecycle.
Illustration of the “changing anything, changes everything” principle (image published under CC 4.0 by INNOQ; https://ml-ops.org/content/motivation)

MLOps Overview

MLOps, short for Machine Learning Operations, is a set of practices and tools that streamline the ML lifecycle and automate the deployment, management, and monitoring of machine learning models and applications in production environments. It supports individuals and organizations in bringing ML applications from experimentation (development) into production and tackling the challenges associated with maintaining performance over time.

With new technologies and tools becoming available and the landscape of commercial solutions rapidly evolving, MLOps practice adoption has become increasingly prevalent.

Most major cloud providers offer solutions as part of their platform services. For example, Amazon SageMaker, Microsoft Azure Machine Learning, Google Vertex AI, and Databricks provide automated model tuning, version control, and monitoring features to help organizations reduce costs and improve reliability when deploying machine learning models. Dedicated ML platform services such as Weights & Biases, DataRobot, Dataiku, and Neptune.ai also offer MLOps features. Integrations of these MLOps solutions are also appearing in managed data platforms such as Snowflake.

Open-source frameworks are attractive because they can be deployed on-premises without licensing or usage costs or when cloud services are not an option because of cost, legal, or policy restrictions. Kubeflow and Seldon Core are frameworks suitable for ML workloads and model deployment on Kubernetes. Another popular solution (also used in our architecture) is MLflow, a platform dedicated to ML experiment and model tracking and management.

These are just a few of the solutions available right now. It is worth remembering that some tools, especially open-source frameworks, only support part of the MLOps lifecycle. Choosing the right mix of tools for integration into existing infrastructure can be challenging.

source: https://www.aporia.com/learn/machine-learning-for-business/introducing-mlops-toys/

Case Study

At ELCA, we had the opportunity to support one of our clients on its journey to implement an on-premises MLOps solution. Cloud offerings were no option in this project due to company restrictions. Nevertheless, our architecture is not bound to specific tools. Migration to cloud environments is possible for any of its parts.

Our client’s need for an MLOps infrastructure arose from struggling with operating many ML applications that had been in production for several years. Each ML application had been developed on an ad-hoc project basis, meaning that it was meant to solve a particular problem present at one point in time. Since the first implementations, the deployment environment had also undergone significant changes.

As a result, the client realized that the various ML application had become increasingly hard to manage. Their performance had decayed over time and resources for anything but simple operational maintenance were not available.

After looking into their infrastructure and applications, we identified the following main problems:

  • Operations were not standardized: maintenance required specialized ML skillsets and application know-how.
  • No model monitoring: blindness regarding the actual performance of deployed models.
  • No reproducibility: ML models could not be reproduced reliably because the parameters of the deployed model were not recorded or were simply outdated.

A lack of standardization is one of the principal resource problems that make maintenance expensive. This can lead to reduced maintenance cycles which aggravate the decay of application performance. To escape this cycle, we standardized and automated model pipelines. A pipeline executes tasks such as data processing, model (re-)training, and service deployment. This automation enables DevOps engineers or automated systems to trigger these tasks. Specialized ML skills can be reserved for non-routine tasks and new initiatives.

The absence of monitoring delays maintenance to a point when problems become obvious to end users. At this point, the model is not delivering optimal value anymore and users begin to lose trust. We standardized model deployment which now comes with model monitoring for all ML applications by default. Model performance can be tracked and interventions can be scheduled early and even automated.

An inability to reproduce ML models leads to situations where a Data Scientist is required even for simple ML model re-training. To address this problem and to manage model versions we now use MLflow to record all inputs (parameters) and output (models, reports, metrics) of ML training runs. This also supports Data Scientists by enabling them to easily track and compare their experiments.

We delivered to our client a monitoring service, and CI/CD (Continuous Integration / Continuous Delivery) pipelines to orchestrate all processes and deploy all services. These components were integrated into pre-existing CI/CD infrastructure. Additionally, all existing ML applications were migrated to this MLOps framework.

Solution Blueprint

As explained in the case study, our architecture was built on top of an existing, on-premises DevOps infrastructure. The schema below depicts the workflow and interactions between various components of the solution in an abstract manner.

MLOps Lifecycle workflow between architecture components (schema by author)

Our architecture builds on top of four typical DevOps components (1–4) and four MLOps components (5–8). This infrastructure can be deployed completely on-premises, in a hybrid mode where some components are provided by cloud services, or fully cloud-hosted. Additionally, the infrastructure can be migrated between these states over time.

  1. Source code repositories for storing and versioning application source code and configurations. Git is the standard with services such as GitHub or Bitbucket providing additional features and free repositories. Azure Repos and AWS CodeCommit are two cloud platform options.
  2. A continuous integration (CI) system such as GitLab CI, Jenkins, or cloud offerings such as Azure Pipelines or AWS CodePipeline. This system builds deployable software artifacts. Builds can be triggered manually or act on code repository events such as commits.
  3. A continuous delivery (CD) system that reacts to CI events and can deploy software artifacts into a runtime environment. This is provided by the CI systems listed above or by more specialized systems such as ArgoCD for Kubernetes.
  4. A monitoring/observability system that collects, aggregates, and displays metrics and logs of running applications. Common solutions are the ELK Stack or a Prometheus/Grafana deployment. In the cloud, these functionalities are provided by services such as Azure Monitor or AWS CloudWatch.
  5. Experiment tracking: Tools for easy or automatic tracking of metadata (parameters and metrics) and artifacts (trained models, reports, plots) produced data science source code execution. We use MLflow Tracking in the case study. An overview of tools can be found here.
  6. Model store/registry: A set of tools to manage the ML model lifecycle. It versions and annotates models and connects the model with the training metadata. We use the MLflow Model Registry. More solutions are listed here.
  7. Model serving: A standardized service or module that loads ML models enables to perform inference via a standardized API. In the case study models are deployed as Web-Services. We use the scoring server provided by MLflow Models which is extended to integrate with our monitoring service. MLflow offers Seldon ML Server, Apache Spark, AWS SageMaker, and Azure ML as alternative deployment methods.
  8. Model monitoring: A service that consumes the inputs or output samples of deployed ML models and compares them against the data at training time to determine whether there is a significant difference (drift). Metrics such as the probability of drift are published to the monitoring infrastructure components. We developed a custom web service that uses Alibi Detect to calculate metrics.

Conclusion

Machine Learning has revolutionized problem-solving, but building and deploying ML models requires continuous management. That’s where MLOps comes in. It streamlines the ML lifecycle by reducing the complexity of operations and accelerating Data Science initiatives. By addressing challenges such as model decay, data control, monitoring, and skill requirements, MLOps is essential to (re-)gain control and scale ML applications.

A real-life case study demonstrated the challenges of managing ML applications. The presented MLOps solution blueprint is proven in production and addresses all the principal problems identified in the case study:

  • Automated model pipelines: Data preparation, model training and validation, packaging, and deployment are automated and standardized for each model. When updated training data is made available (manually curated or automatically collected from a feedback system), DevOps engineers can retrain and deploy models without specific ML skills or deep application know-how.
  • Automated deployment: All ML applications use the same standard model serving component. Together with continuous deployment, operations and maintenance are greatly simplified.
  • Monitoring: A monitoring service enables the monitoring of all deployed models for decay. The integration is done in the model serving component. Therefore, all ML models can be monitored for model decay.
  • Reproducibility: Metadata and artifacts of all model training runs are recorded, ensuring reproducibility and supporting Data Scientists when developing new models.

The transition to a MLOps framework is an investment, but organizations can ensure sustained performance, efficiency, and innovation throughout the ML lifecycle, bringing clarity and control to their ML applications. Since the transition, our client has been more inclined to adopt new ML applications and the added transparency generates confidence and trust in the applications.

--

--