An End-to-End MLOps Platform Implementation using Open-source Tooling

11 min readMay 3, 2022

By Omolade Saliu, IBM Executive Data Scientist and Head of Data Science

For the last nine years at IBM, I have had the privilege of driving several first-of-a-kind AI solutions for our clients across multiple industries. Recently, we have been helping many of our clients transition from a strategy where data science ends at sharing insights with the business to a product-oriented one where deliverables are embedded in enterprise solutions. Though the former approach will continue to bring business values, full return on investment (ROI) would only be realized when data science deliverables go all the way from experimentation to production and put in the hands of the end users. This transition, as you would have guessed, has a lot of engineering and operationalization aspects to it, giving rise to the Machine Learning Operations (MLOps) movement we see in the AI industry today.

MLOps seeks to apply proven DevOps practices to automate the building, deployment, and monitoring of Machine Learning (ML) pipelines in production. As the field of MLOps continues to emerge and garner attention across the enterprise, there is a growing number of disparate, largely open-source tools/frameworks being developed to support specific MLOps capabilities, just as there are various vendor platforms.

A major challenge that our clients continue to struggle with is the lack of a common pattern that brings the best of these tools together. In this article, I would be sharing an MLOps architecture to address this challenge in a platform-agnostic manner, an approach that could serve as a template to help accelerate the process of taking data science experimentation to production in an easy, reusable, and consistent manner. We’ll start with a quick background on the motivating factors and our journey. Then we spend the rest of the article describing the open-source tools we have leveraged in the context of the MLOps capability they enable, the MLOps life cycle processes that need to be supported, the resulting platform implementation, and some lessons learned from implementing this platform at scale.

Some Background

In 2019, I led an initiative with a small technical working group within our AI & Analytics Practice to explore tools that would enable our data scientists standardize the way we implement data science solutions. The goal was to ensure we can share knowledge easily, accelerate our delivery, consistently reuse assets and accelerators, while ultimately staying ahead of our clients needs. So, we started by experimenting with the following data science tooling:

Kedro — an open-source Python library for creating reproducible, maintainable, and modular data science pipelines by adopting software engineering concepts.
IBM MLApp — a Python library for building scalable data science solutions that meet modern software engineering standards.
Metaflow – Python library to help manage data science projects.
Cookiecutter data science — a scaffolding tool for building consistent project structure and templates for data science.

However, the experimentation quickly morphed into an exploration of workflow/pipeline orchestration frameworks like Airflow, Kubeflow, Argo etc. Eventually, we felt we have learned enough to take on the challenge of building an end-to-end platform for MLOps. But, before diving into the platform itself, what are the MLOps processes we are trying to enable?

Building MLOps Platform with Open-Source Tools

The Google MLOps whitepaper by Khalid Salama and his team [2] provides a good overview of the integrated and iterative MLOps lifecycle processes (see Figure 1).

Figure 1: The MLOps Process LifeCycle [2]

MLOps platforms need to provide core technical capabilities and tooling that will enable organizations to effectively implement these MLOps lifecycle processes. These technical capabilities become easier to navigate if we break them down into categories. Here we’ve suggested the categories of ML Pipeline Development, Experiment & Metadata Tracking, Model Deployment, Automation, Serving, and Monitoring.

ML Pipeline Development — orchestrating ML training and prediction workflows, coordinating the different stages (e.g. data preparation, model training, evaluation, etc.) — and making it easy to automate deployment of repeatable pipelines.
Experiment & Metadata Tracking — track ML experiments to collect information about what data went into which model that produced what performance result, including model parameters and artifacts.
Model Deployment — enable packaging and deploying trained model to a target serving environment.
Automation — extends traditional CI/CD to support Continuous Training (CT) in addition to automating the build, test, and release of ML pipelines.
Model Serving — allows the deployed model to accept prediction requests for inferencing by taking input data and providing responses with predicted results.
Monitoring — enables the tracking and reporting of deployed model in production to identify performance degradation and inform further actions.

While most open-source tools specialize on one or more core capabilities, there exists vendor solutions attempting to consolidate some of these capabilities in a unified and managed ML platform. Taking a more holistic approach and integrating some leading open-source tools in a single cloud-agnostic MLOps platform is exactly what we did in our implementation. There are obviously many open-source MLOps tools (see “The Best MLOps Tools and How to Evaluate Them” [1]), Table 1 below highlights the main tools used in our implementation, the technical capability they enable, and the MLOps process they help organizations implement (see details Table 1):

Table 1: MLOps Technical Capability, Processes, and Open-Source Tooling

Architecture Overview

The end-to-end architecture as shown in Figure 2 underpins our MLOps platform and can be deployed on any Kubernetes compliant cluster whether it be locally, on-premises, or on the cloud. However, for our purposes, we deployed on an Azure Kubernetes Services (AKS) cluster. The architecture follows a microservices pattern, so components can be deployed individually and integrated via API calls. Kubeflow and MLflow services should be the first two services to deploy, and both are connected in Kubeflow workspace through MLflow URI. Once deployed, GoCD is able to monitor the Github repository and have linkages to both MLflow and Seldon Core hosts to facilitate deployment. Prometheus is deployed on the cluster and setup with endpoints of the Kubernetes cluster to be monitored and integrated with Grafana to expose the collected metrics. Streamlit resides in the consumption layer of the architecture to interface with the endpoints of any deployed models.

Figure 2: Open-source Driven MLOps Architecture Implemented in the Platform

Platform Capabilities

The MLOps platform is an implementation of the architecture in Figure 2. Below is a summary description of how the platform support end-to-end ML model operationalization.

ML Pipeline Development:
The architecture workflow starts with ML pipeline development that supports data science exploration and development within Kubeflow’s notebook server. The notebook server is a collaborative, yet isolated, workspaces for data scientists to work and scale experimentation. Data scientists author ML pipelines using Kubeflow Pipeline (Kfp) SDK, a domain-specific language (DSL) for specifying the complete modeling workflow, the steps or components and how they interact with each other. The SDK, behind the scenes, packages each step (e.g. data preprocessing, model training, model validation) as individual Docker image and will run in its own container in the underlying Kubernetes infrastructure. The components of the model workflow are expressed as jobs to be submitted to Kubeflow pipelines. The data scientist doesn’t necessarily need to know much about containers orchestration and management, which is handled automatically by Kubeflow. What Kubeflow does is to take the data science code and convert that into a container DAG (directed acyclic graph) like you would see in Airflow, and it then allows Argo to automatically orchestrate the entire workflow. One of the advantages of Kubeflow is that it enables a data scientist to run multiple training modules in parallel. Let’s say we have main models training and outlier detector training, we don’t have to run them sequentially.

Experiment and Metadata Tracking:
Although Kubeflow has a UI that shows experiments and training runs, it lacks the sophistication of MLflow and the ability to easily persist past experiment data, which can make monitoring and reproducibility difficult. MLflow does experimentation tracking a whole lot better. This is the main motivation for us integrating MLflow with Kubeflow.

An experiment or training run generally kicks off experiment tracking activities in the form of simple MLflow API calls within the pipeline code. This way, the platform tracks and stores in the MLflow server, the raw, cleansed/processed data, schemas, trained models, hyperparameters, evaluation results, and other metadata and artifacts including the Python version used in training the models. All tracked artifacts are then exposed through MLflow UI where they are easily accessible to inspect the experimental runs.

Model Deployment:
We leverage Seldon Core to deploy models developed on the platform. MLflow comes with model registry capability that makes it easy for us to govern the lifecycle of ML models by storing, versioning, and organizing the models into staging or production deployments. MLflow also offer built-in integration to several deployment services, including Seldon Core. Thus, our Seldon Core deployment service is setup to pick up the appropriate models from the MLflow model registry, deploy the models to the target serving environment and generate the endpoints for model serving in production.

Seldon Core enables us to deploy our models to any Kubernetes cluster either on-premise or on any cloud. To demonstrate the cloud agnostic capability of the platform, we swapped out Seldon Core and implemented alternative deployments to Azure cloud through MLflow. We created and attached Azure Kubernetes Services (AKS) inference cluster to enable deployment of models to Azure Machine Learning workspace without changing any of the upstream tasks on the platform. This is an important functionality for an organization running multiple platforms; we can always deploy models to different target environments without changing codes, except for the deployment layer.

Model Serving:
Streamlit is a Python framework that allows us to quickly build web apps as the consuming interface that can make calls to the deployed model endpoints in the serving engine to enable both interactive online inferencing. There’s also the option for batch inferencing for bulk data scoring on the platform.

Automation:
To ensure the data scientists don’t have to manually repeat these steps for every experiment, we implemented automation capability for both the ML training and deployment pipelines using GoCD. We picked GoCD for our CI/CD implementation out of convenience; however, any other CI/CD tool would work just fine. The CI/CD implementation ensures we can build, test, release and run the entire ML pipeline, Now, when a data scientist needs to run the training and deployment pipelines, they push changes to the Github repository where the code is stored, which automatically triggers the GoCD pipeline run. It is worth noting that the automated pipeline training and deployment also enables the continuous training (CT) process that differentiates machine learning solutions from traditional software applications that do not require retraining.

Monitoring:
For the monitoring layer, we leveraged both Prometheus and Grafana to track real-time metrics for visualization on a dashboard to support continuous health monitoring of running models as well as the underlying infrastructures. We collect and monitor both operational metrics (e.g., request rate and latency), model quality metrics (e.g., data drift, model drift, outlier detection), and custom metrics relevant to specific use-cases implemented on the platform. In addition to tracking and visualization, we are also able to setup alerting and escalation mechanism that can provide the functionality to notify on-call support when drifts are detected, either in the data or model. Since we also have a CI/CD server running, we can trigger automatic retraining and deployment for specific models based on alert thresholds.

Lessons Learned

Here are some lessons learned that we would like to share from implementing this MLOps platform at scale:

MLOps requires a culture shift — it’s not going to happen magically once a model is developed. It requires lot of buy-ins and alignments across teams over things like code versioning, deployment frequency, end-to-end testing and governance.
The use of open-source vs cloud native or managed services will require breadth of knowledge across tools/frameworks and the ability to maintain and debug at scale, which is not trivial.
Using the right tool for the right job — there are overlaps in a lot of tools — we can deploy models and track experiments using Kubeflow, but MLflow is more streamlined and not as brittle.
Creating is not the hardest, maintenance is — maintaining a platform like this with disparate tools requires diverse skills and experiences, which some organizations will lack. Besides, every time underlying tools and services are updated, or new versions released, it’s a lot of work to keep the platform updated and running. For example, say one of the images from Kubeflow got deprecated, it will break the automated build pipeline. It takes hours of effort to diagnose and fix this.

Summary

To conclude, the end-to-end MLOps platform that we have built is a flexible accelerator that can facilitate operationalization of myriads of machine learning use cases. While it requires a lot of diverse skills to implement, the opportunity for an organization to truly build custom workflows that is both vendor-agnostic and cloud-agnostic will always be appealing, especially given the opportunities to benefit from innovations coming out of the large open-source community. There are other capabilities that we have in the backlog to extend the platform, including the implementation of a feature store, for example. Feature store will help reduce the need for teams or lines-of-business to reinvent the wheel on data preparation by leveraging features already created from raw data by others. We will also continue to build boilerplates that include library of pre-built model templates on top of this platform.

MLOps continues its rapid growth and it’s no surprise that on January 25, 2022, Canonical team released the Canonical Charmed Kubeflow 1.4 that supports MLflow integration [3]. Although, we successfully implemented and deployed our platform in 2021 Q2, we were intrigued with the chance to experiment with Charmed Kubeflow, so we don’t have to install MLflow separately. In late March 2022 we implemented a version of our platform using Charmed Kubeflow, instead of the standard Kubeflow and I hope to share our experiences at some point in the future.

We hope you find this article helpful. We are always learning, so if you have any suggestions or experiences with similar approaches integrating open-source tools in a single platform, please share them in the comments section. We would love to hear and learn from others in this space.

Acknowledgment

Thank you, Emmanuel Ibidunmoye, Judah Okwuobi, Mohammed Taboun, Joshua Orfin, Linde Chen, and other IBMers who collaborated with me on this work as well as our pool of talented Interns that have been a part of this journey. Raheem Rufai, Shahir Daya, thanks for the incisive feedback.

References

[1] Jakub Czakon, The Best MLOps Tools and How to Evaluate Them, https://neptune.ai/blog/best-mlops-tools, January 17, 2022, [Accessed: 28-Apr-2022]

[2] Khalid Salama, Jarek Kazmierczak, Donna Schut, Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning, https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf, White Paper, May 2021, [Accessed: 28-Apr-2022]

[3] Canonical Charmed Kubeflow, “Canonical Reveals Charmed Kubeflow 1.4”, https://www.dbta.com/Editorial/News-Flashes/Canonical-Reveals-Charmed-Kubeflow-14-151100.aspx, January 25, 2022. [Accessed: 28-Apr-2022]

[4] Kedro, “Welcome to Kedro’s documentation!”, https://kedro.readthedocs.io/en/stable/, [Accessed: 28-Apr-2022]

[5] MLApp, https://github.com/IBM/mlapp, [Accessed: 28-Apr-2022]

[6] “Welcome to Metaflow for Python”, https://docs.metaflow.org/,[Accessed: 28-Apr-2022]

[7] “Cookiecutter Data Science”, https://drivendata.github.io/cookiecutter-data-science/, [Accessed: 28-Apr-2022]