Build a Production-Ready MLOps Framework Using VertexAI

Your first step in establishing an MLOps practice and operationalizing ML systems.

Siddharth Suri
Slalom Data & AI
8 min readNov 14, 2022

--

A machine learning operations (MLOps) framework enables machine learning to build and deploy at scale by simplifying machine learning (ML) design, training, testing, deployment, and performance monitoring. MLOps systems combine proven DevOps concepts such as continuous integration and continuous delivery (CI/CD) with more data and ML-specific concepts such as “feature store” and “model monitoring”.

Photo by Hitesh Choudhary on Unsplash

In this article, we’ll cover the following steps and services involved in buildingthe MLOps framework using Google Cloud’s Vertex AI offering:

  • Code-driven MLOps framework for data ingest, data prep, model training, model retraining, model deployment, model monitoring, and much more
  • ML pipelines automation
  • CI/CD framework for deploying the pipelines to production
End to end MLOps flow

Getting started

To develop a code-driven MLOps framework, it is important to separate the model code from the operations (“ops”) code and develop the system outside a notebook environment. We want to establish a system in which the availability of new data triggers a retrain for the ML model. And the availability of a new implementation of the ML pipeline (including a new model architecture or feature engineering) will trigger a deployment and re-execution of the new ML pipeline using CI/CD.

To develop an MLOps system on Google Cloud, we used the following Google Cloud resources:

  1. Source repo
  2. Cloud Build
  3. Vertex AI Workbench managed notebooks
  4. Vertex AI pipelines (managed Kubeflow pipelines)
  5. Artifact Registry
  6. Google Cloud Storage
  7. BigQuery
  8. Vertex AI Model Registry
  9. Vertex AI Metadata
  10. Cloud Functions
  11. Cloud Scheduler
  12. Vertex AI Feature Store

Building your MLOps system

Step 1: Prep your data.

This step includes performing all data extraction, preprocessing, and feature engineering needed for your model and ingest them into the Vertex AI Feature Store.

Tip: Separate out each function as a Kubeflow component (e.g. separate Python functions).

Step 2: Develop the model.

Choose the appropriate ML algorithm. Build and train the model. Register in the Vertex AI Model Registry and deploy it to any Vertex AI endpoint to serve online use cases or batch predictions.

Data and Model train pipelines

Step 3: Develop your pipeline.

Step 3: Develop your pipeline.

We recommend using Vertex AI Pipelines to orchestrate the steps of your ML workflow. This feature is heavily based on Kubeflow and uses the Kubeflow Pipelines python package (KFP) to define the pipelines. Running a pipeline consists of three tasks:

A) Define the steps. A pipeline is made up of various steps called ‘components’. A pipeline component (reusable) is self-contained code that is packaged into a Docker container. There are a few different options for defining components:

  • Docker images
  • Decorators
  • Converting functions

We will define all of the steps of our ML workflow in separate Python functions and convert each of our functions into a component that will be used by Vertex AI Pipelines.

B) Arrange the steps into a pipeline. Take the steps (components) defined in task A) and wrap them into a function with a pipeline decorator.

We will specify dependencies between steps and Vertex AI Pipelines and define the correct order for pipeline orchestration. We will also set memory and CPU requirements for individual steps. Why? So that if one step requires a large amount of memory or number of CPUs, Vertex AI Pipelines provisions a compute instance sufficient to perform that step.

C) Compile and run the pipeline. The compile function packages your pipeline. When you invoke the pipeline run, you can pass in various arguments that are used by your pipeline. You can also specify configurations, such as whether to enable caching to accelerate pipeline runs and which service account to use when running the pipeline.

Reference Kubeflow pipeline

Step 4: Create a Cloud Build flow (automated deployments of Vertex AI Pipelines CI/CD).

Cloud Build is a service that executes CI/CD routines on Google Cloud infrastructure. In Vertex AI, Cloud Build is used to build, test, and deploy machine learning pipelines. Each build step is run in a Docker container. The following tasks will be executed as part of your Cloud Build flow:

A) Clone the source repo

B) Run unit tests

C) Build the component images (Docker container images)

D) Push the images to Artifact Registry

E) Compile your pipeline

F) Upload compiled JSON files to Google Cloud Storage

G) Deploy and run your pipeline

Cloud Build steps

We will have build routines (Cloud Build configuration files) that will be executed in response to different triggers.

To create a Docker image, your component file should have the following:

  • Dockerfile — the Docker daemon runs the instructions in the Dockerfile one by one, committing the result of each instruction to a new image, if necessary, outputting the ID of your new image. The Docker daemon will automatically clean up the context you sent.
  • component.yaml — a component specification takes the form of a YAML file and describes the container component data model for Kubeflow Pipelines. The data model is serialized to a file in YAML format for sharing.
  • python function — the actual code (.py file).
  • requirements.txt — includes packages to install for the python function.
Docker build flow

Create the Cloud Build triggers to push the component images to the Artifact Registry. There will be a trigger created for each environment, including:

A) trigger to deploy the terraform code

B) trigger to deploy the components

C) trigger to deploy the pipelines

Each Cloud Build will be triggered after pushing the respective YAML file into the Artifact Repository. The env variables will be defined in the triggers using substitution variables.

Cloud Build substitution variables
CI/CD flow

Step 5: Build your infrastructure.

You can use Terraform to create all the infrastructure and services required for MLOps and store the scripts in your Cloud Source Repository.

In this example, we will create Google Cloud Storage buckets, Artifact Registry, Cloud Functions, Feature Store, Notebook, Metadata Store, and Cloud Scheduler jobs via Terraform scripts.

Organize your source repository using the following structure:

  • cloud-build — all Cloud Build YAML files
  • components — the ML code (built as a Python function and stored as a component), the Dockerfiles required to create your Docker images (one for each Kubeflow Pipelines component)
  • notebook — any notebooks used for exploratory data analysis, model analysis, and interactive experimentation on models
  • pipelines — your Python module (for example, the model_train.py module) where the Kubeflow Pipelines workflows for data extraction, model train, model inference, and model retrain are defined
  • src — the main.py and requirements.txt code for your Cloud Functions
  • terraform — Terraform code for Artifact Registry, Cloud Functions, Cloud Scheduler, and Vertex AI Workbench notebooks, etc.
  • tests — Python unit tests for the methods implemented in the component

Step 6: Define your IAM roles, permissions, and service accounts.

A service account has to be created for running machine learning pipelines. This account will have access to all the underlying Google Cloud services used for MLOps.

Like building your infrastructure, it is leading practice to provision all the roles and permissions using Terraform.

Step 7: Create a production-ready run.

In a production MLOps solution, ML pipelines need to be repeatable. We will create a Cloud Function to trigger the execution of ML pipelines on Vertex AI. This can be done either using a schedule (via Cloud Scheduler) or using event processing.

In this example, we use Cloud Build to compile the pipelines using the Kubeflow Pipelines software development package and publish them to a Google Cloud Storage bucket. A Cloud Function retrieves the pipeline definition from the bucket and triggers an execution of the pipeline in Vertex AI.

Continuous training pipeline

Step 8: Set up logging and failure notifications.

Set up alerts on the Cloud Logging to establish notifications of any pipeline failures for yourself and your support team.

Step 9: Establish continuous monitoring.

Once the ML model has been deployed, it should be monitored to assure the model performs as expected. If suddenly a model starts returning errors or running unexpectedly slowly, you’ll want to know (and be able to fix it) immediately, ideally before your end-user notices.

Model monitoring is a continuous process; therefore, it is important to identify the critical elements to monitor for performance and reliability.

You should create a strategy for the model monitoring before deploying the model to production. Define the skew and drift thresholds for notifications and alerts, deploying the monitoring job on Vertex AI. Examine your model monitoring data from the Cloud Console.

When your model needs to be re-trained on new data, you can run the re-training pipeline. If the model needs changes to architecture, features, etc., simply deploy and re-execute the new ML pipelines using CI/CD.

Step 10: Track your metadata.

The metadata store, Vertex ML Metadata, keeps track of all the metadata associated with Vertex Pipeline runs and allows you to query the parameters and inputs/outputs of steps of past pipeline runs.

You can record the metadata and artifacts produced by your ML system and query that metadata to help analyze, debug, and audit the performance of your ML system or the artifacts that it produces.

Summary

In this article, we covered the steps to build a production- ready MLOps Framework using Vertex AI.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--