ML Systems Industrialization and MLOps

12 min readFeb 23, 2023

1. Ideal Goal of Data & ML Systems

Industrialization of Data Consumption by ML systems during both experimentation (historical batch) and production (batch & stream) — grow above & beyond toy-ML-with-csv and single-threaded-pickle-flasked-deployment.

Thinking about end-2-end Data & ML systems, it would be fantastic to have a good separation-of-concerns between Data Engineering, Feature Engineering, Model Experimentation and Model Deployment through boundaries of Data Lake, Analytics Data Store (Feature Store), and Model Store (Model Repository).

These data/feature/model stores should be decoupled from producer/consumer perspective and their corresponding meta-stores should have their Governance Model for mature lineage and tracking [ref 1].

Modularizing Data & AI landscape by decoupling concerns through clear boundary of artifacts stores would help in achieving business agility and growth.

https://www.slideshare.net/databricks/scaling-ridehailing-with-machine-learning-on-mlflow

2. Challenges and Motivations of MLOps

Data Scientists can train an ML model with predictive performance on an offline holdout dataset, given relevant training data for their use case. However, the real challenge lies in building an integrated ML system and to continuously operate it in production.

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management. MLOps enables engineering team to track, version, audit, certify and re-use every asset of ML lifecycle and provides orchestration services to streamline managing this lifecycle.

Somewhat grounded motivations for MLOps could be the following: -

Should be able to take a ML model created earlier and re-train it in the same environment with the same data and get the same accuracy. The real motivation behind this is to be able to take someone else’s ML model and try improving it without talking to them.
Should be able to trace back from any ML model running in production to its provenance, i.e., to exactly the parameters and training data used to create it, and even the lineage/pre-process tracking back from training data to raw data.
Should be able to take a ML model trained anywhere (platform, framework, or language) and deploy on different target environments with zero manual steps.
ML Models can go wonky in production without normal monitoring techniques detecting it, so need statistical monitoring.
In the event of drifted ML system (due to so called data/model/concept drift), automated re-training and re-deployment of an ML Model could be yet another motivation — provided risk and governance of new ML Model Evaluation and Deployment too can be automated or gated.
Scalability is surely an implicit and inherent expectation from MLOps system.

3. DevOps of ML System vs Software System

An ML system is after all a software system, so similar practices apply to help guarantee that you can reliably build and operate ML systems at scale. ML and other software systems are similar in CI (continuous integration) of source control, unit testing, integration testing, and CD (continuous delivery) of the software module or the package. However, there are a few notable differences in ML:

CI is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
CT is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.

4. MLOps Maturity Levels

The MLOps system with highest level of maturity “MLOps level 2: CI/CD pipeline automation”, as per GCP MLOps guidelines[ref 2], should have their CI, CD & CT automated (refer to below diagram) through MLOps Pipeline. The pipeline consists of the following stages:

1. Development and experimentation: You iteratively try out new ML algorithms and new modelling where the experiment steps are orchestrated. The output of this stage is the source code of the ML pipeline steps that are then pushed to a source repository.

2. Pipeline continuous integration: You build source code and run various tests. The outputs of this stage are pipeline components (packages, executables, and artifacts) to be deployed in a later stage.

3. Pipeline continuous delivery: You deploy the artifacts produced by the CI stage to the target environment. The output of this stage is a deployed pipeline with the new implementation of the model.

4. Automated triggering: The pipeline is automatically executed in production based on a schedule or in response to a trigger. The output of this stage is a trained model that is pushed to the model registry.

5. Model continuous delivery: You serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.

Monitoring: You collect statistics on the model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.

MLOps Maturity — as per GCP MLOps Guidelines

Other MLOps maturity levels are “MLOps level 1: ML pipeline automation” and “MLOps level 0: Manual process”.

5. MLOps Pipeline & Orchestration

ML Pipeline is the foundation for effective industrialization of any ML based product — as per Rules of ML by GCP [ref 3]. The importance of ML Pipeline is quite evident from above elaboration of ML Maturity Levels. In fact, it suggests to first get your ML Pipelines ready and then whole ML system builds, integrates, and delivers iteratively and successfully through these ML Pipelines.

To make great products: “do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.”

Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning expert, most of the gains come from great features, not great machine learning algorithms. So, the basic approach is:

Make sure your pipeline is solid end to end.
Start with a reasonable objective.
Add common-sense features in a simple way.
Make sure that your pipeline stays solid.

Any data driven workflow pipelines — whether it is DataOps or MLOps pipelines — are mostly some kind of DAG (directed acyclic graphs) as their constituent tasks are naturally directed. Difference between DataOps and MLOps pipelines lies just in the nature of data processing — whether it is of ETL, ELT or ML kind of processing.

Next, Orchestration is the process of coordinating the execution and monitoring of these workflow pipelines — coordinating dependencies, executing tasks in desired order, detecting potential errors, and solving them or generating alerts and logs [ref 4].

https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch11.html

Workflow orchestration tools allow us to define DAGs by specifying all tasks and how they depend on each other. The tool then executes these tasks on schedule, in the correct order, retrying any that fail before running the next ones. It also monitors the progress and notifies team when failures happen.

Few popular examples of data workflow pipelines orchestrator tools/frameworks are

· OSS tools (Apache Airflow, Apache Beam, Apache Nifi, Apache Luigi, Apache Oozie, etc)

· GCP Cloud native tools (GCP Cloud Composer based on Apache Beam, GCP DataFlow based on Apache Airflow, etc.)

· AWS Cloud native tools (AWS Data Pipeline, AWS Glue, etc.)

· Azure Cloud native tools (Azure Data Factory, Azure SSIS, etc.)

Notice that there is no such thing as MLOps specific workflow pipelines orchestrator. Observe a definitive list of MLOps orchestrator in https://neptune.ai/blog/best-workflow-and-pipeline-orchestration-tools [ref 5] but notice that neither Kubeflow nor MLFlow are listed — as both Kubeflow and MLFlow are MLOps tools and frameworks, not an ML orchestrator.

6. MLOps Tools & Frameworks

MLOps tools and frameworks utilize generic data workflow pipelines orchestrators (refer to table below). Here we attempt to take brief low down into key MLOps tools and frameworks

6.1. Kubeflow

Kubeflow is an OSS ML platform designed to enable using ML pipelines to orchestrate complicated workflows running on Kubernetes. Argo is the orchestrator engine underneath Kubeflow Pipeline.

Kubeflow Pipelines can help to compose, deploy, and manage end-to-end (optionally hybrid) machine learning workflows — supports seamless advancing from prototyping to production anywhere on-premises or even major Cloud.

The constructs of Kubeflow Pipeline are open ended and fully customizable, i.e., its task types are not prototypically fixed. Users can define any kind of steps and control flow in a Kubeflow Pipeline.

Kubeflow ranks very high as MLOps toolset among OSS options as well as it integrates quite well with major Cloud ML platforms (Azure ML, AWS Sagemaker, and GCP Vertex AI), e.g., AWS Sagemaker support Hybrid ML with Kubeflow through Sagemaker Operators for Kubeflow [ref 6]. SageMaker Operators for Kubernetes make it easier for developers and data scientists using Kubernetes to train, tune, and deploy machine learning (ML) models in SageMaker. In essence, this enables seamless advancing from prototyping-on-Kubeflow to production-on-Kubeflow-and-fully-managed-Sagemaker.

This hybrid approach of Kubeflow and Cloud ML platform can be big enabler for achieving compute and scale intensive operations like train/tune/deploy/etc on Cloud ML platform while retaining the control and orchestration within Kubeflow, i.e., best of both worlds! The most desirable benefit of this hybrid approach would be complete portability and scalability.

Quick bytes on Kubeflow -vs- Airflow comparison [ref 7]: Kubeflow was created by Google to organize their internal machine learning exploration and productization, while Airflow was built by Airbnb to automate any software workflows. Airflow is purely a pipeline orchestration platform but Kubeflow can do much more than orchestration.

https://valohai.com/blog/kubeflow-vs-airflow/

Quick bytes on Kubeflow -vs- MLFlow comparison: Kubeflow solves infrastructure orchestration and experiment tracking, while MLFlow is primarily for experiment tracking (and model versioning & deployment too). Kubeflow is, at its core, a container orchestration system, and MLFlow is a Python program for tracking experiments and versioning models. Think of it this way: When you train a model in Kubeflow, everything happens within the system, i.e., within the Kubernetes infrastructure it orchestrates — while with MLFlow, the actual training happens wherever you choose to run it, and the MLFlow service centrally listens in on parameters and metrics for experiment tracking. Thinking constructively, both are in fact complementary to each other.

Note that both Kubeflow and MLFlow support “experimentation first” approach unlike other generic data workflow orchestrators (Airflow or Luigi) which rather support “dependency and scheduling first” approach.

6.2. MLFlow

MLFlow is an open-source platform that helps manage the whole machine learning lifecycle — though all processing is expected to be delegated to external compute platforms. Essentially, MLFlow can be thought of as a platform agnostic ML workflow control plane.

MLFlow platform includes key components — Tracking, Projects, Models, and Registry — to take care of core MLOps requirements — experimentation, reproducibility, deployment, and storage.

NOTE that even Databricks managed MLFlow as commercial option is available. If using open-source MLFlow, one must have required IT experience for setting up production grade setup for security, users management, high availability, fail-over, and all aspects of maintenance. Besides, MLFlow server setup must establish secure integration with backends for core ML processing tasks.

The open-source MLFlow based ML system can be setup on any Cloud IaaS and can delegate ML processing onto any Cloud ML using their corresponding Cloud SDK. Explicit Cloud ML platforms support is available primarily in the form of AWS Sagemaker and Azure ML compatible deployments and models & containers uploading onto Cloud repositories.

Though MLFlow has elaborate set of components — artifacts, model-registry, models, projects, scoring, tracking, server-infra, etc. — but “models” component deserves an elaborate explanation. MLFlow “models” supports packaging ML Models in multiple flavors [ref 8], and a variety of tools to help deploy ML Models. Each Model is saved as a directory with arbitrary files and a descriptor listing supported Model packaging “flavors”. Any model supporting the “Python function” flavor can be deployed to a Docker-based REST server, to cloud platforms such as Azure ML and AWS SageMaker, and as a user-defined function (UDF) in Apache Spark for batch and streaming inference.

MLFlow Models https://mlflow.org/docs/latest/models.html

6.3. Google TensorFlow Extended (TFX)

TFX [ref 9] is a configuration framework to express ML pipelines consisting of TFX components but it is specific to TensorFlow.

TFX Pipelines can be orchestrated using Apache AirFlow, Apache Beam or as Kubeflow pipeline.

Even TFX Pipelines support seamless advancing from prototyping to production anywhere on-premise or GCP Cloud.

6.4. Google Vertex AI Pipeline

Vertex AI [ref 10] brings together the Google Cloud services for building ML under one, unified UI and API. Vertex AI is fully managed and serverless.

Vertex AI Pipeline https://codelabs.developers.google.com/vertex-pipelines-intro#1

Vertex AI Pipeline helps to automate, monitor, and govern ML systems by orchestrating ML workflow and storing workflow’s artifacts using Vertex ML Metadata. Vertex AI Pipeline supports running both TFX Pipeline as well as Kubeflow Pipeline.

Though users could suspect the risk of vendor lock-in, but it is expected to support portability due to its compatibility with Kubeflow.

6.5. AWS Sagemaker Pipeline

The AWS SageMaker Pipelines service [ref 11] supports a SageMaker Pipeline domain specific language (DSL), which is a declarative JSON specification. This DSL defines a directed acyclic graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python Software Developer Kit (SDK) streamlines the generation of the pipeline DSL using constructs that engineers and scientists are already familiar with.

AWS Sagemaker Pipeline is fully managed where its control plane, model repository, and all compute instances & clusters are fully managed.

Sagemaker Pipeline https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.html

AWS Sagemaker even extends orchestration of Sagemaker Train, Tune, Deploy and Batch Transform from within Apache Airflow based workflow pipeline using Airflow Sagemaker Operators or Airflow Python Operator & Sagemaker SDK [ref 12].

Sagemaker Operator for Airflow https://aws.amazon.com/blogs/machine-learning/build-end-to-end-machine-learning-workflows-with-amazon-sagemaker-and-apache-airflow/

AWS Sagemaker also can even co-exist with MLFlow (refer to this AWS ML blog) [ref 13] where all backend ML processing (training experiments & runs, deployment, etc) are performed on AWS Sagemaker platform and all metadata of tracking, model, deployment, etc are hosted in MLFlow repositories.

Sagemaker with MLFlow https://aws.amazon.com/blogs/machine-learning/managing-your-machine-learning-lifecycle-with-mlflow-and-amazon-sagemaker/

7. MLOps Pipeline Automation

Manually triggering MLOps pipeline may be enough during development and experimentation. This manual approach works if your team manages only a few ML models, or if ML models do not require re-training, or if the whole ML system does not require much iteration.

Observe in “MLOps Maturity Level” section that except the 1st step of Development & Experimentation, all other steps demand CI/CD and CT automation.

Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross functional team produces machine learning applications based on code, data, and models in small and safe increments, that can be reproduced and reliably released at any time in short adaptation cycles. https://databricks.com/session_na20/productionalizing-models-through-ci-cd-design-with-mlflow [ref 14]

We had briefly touched upon high level difference between Software System and ML System from CI/CD/CT perspectives. We can observe in below table as how many varieties of input elements are involved in ML System.

ref: AIM406 — AWS re:Invent 2020: Implementing MLOps practices with Amazon SageMaker — https://www.youtube.com/watch?v=8ZpE-9LnaJk

8. Industrialized Data and AI Engineering Acceleration (IDEA) BY Capgemini

IDEA by Capgemini [ref. 15] — is a suite of capabilities, accelerators, frameworks and methodologies that help organizations modernize their data estate as it relates to people, processes and technology. IDEA by Capgemini leverages cloud technology, DevOps principles and a modular microservices architecture to reduce the burden of change on the workforce and help organizations simplify and streamline every aspect of their Data and AI modernization journey.

IDEA has Insights Foundation module for AI acceleration on Cloud ML platforms for all phases of ML lifecycle — including MLOps over Cloud ML with Kubeflow.

Author leads AI/ML workstream of IDEA accelerators platform at Capgemini.

References

https://www.slideshare.net/databricks/scaling-ridehailing-with-machine-learning-on-mlflow
GCP MLOps Guidelines https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Rules of ML by GCP https://developers.google.com/machine-learning/guides/rules-of-ml
Building Machine Learning Pipelines https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch11.html
Best Workflow and Pipeline Orchestrator Tools https://neptune.ai/blog/best-workflow-and-pipeline-orchestration-tools
AWS Sagemaker Operator for Apache Kubeflow https://aws.amazon.com/blogs/machine-learning/introducing-amazon-sagemaker-components-for-kubeflow-pipelines/
Kubeflow vs Airflow https://valohai.com/blog/kubeflow-vs-airflow/
MLFlow Models https://mlflow.org/docs/latest/models.html
Google TensorFlow Extended TFX https://www.tensorflow.org/tfx
GCP Vertex AI Pipeline https://cloud.google.com/vertex-ai/docs/pipelines/introduction
AWS Sagemaker Pipeline https://aws.amazon.com/sagemaker/pipelines/
Sagemaker Operators for Airflow https://sagemaker.readthedocs.io/en/stable/workflows/airflow/using_workflow.html
Managing your machine learning lifecycle with MLflow and Amazon SageMaker https://aws.amazon.com/blogs/machine-learning/managing-your-machine-learning-lifecycle-with-mlflow-and-amazon-sagemaker/
CD4Ml with Databricks MLFlow https://databricks.com/session_na20/productionalizing-models-through-ci-cd-design-with-mlflow
IDEA — by Capgemini https://www.capgemini.com/solutions/industrialized-data-ai-engineering-acceleration-idea-by-capgemini/