Vertex Pipelines (Kubeflow) for Machine Learning Model Training Pipeline

河本直起 (Naoki Komoto)

Published in

AnyMind Group

7 min readJul 7, 2022

Hi, I’m Naoki Komoto (河本直起) working as Machine Learning Engineer in AnyMind.

In AnyMind, we are developing our MLOps environment, including the model training pipeline from scratch.

In previous article, I introduced our data pipeline using Cloud Composer (Airflow) including its current set up and future plans.

Cloud Composer (Airflow) for Machine Learning Data Pipeline

In this article, I’d like to introduce our data pipeline using Cloud Composer (Airflow) including its current set up…

medium.com

In this article, I’d like to introduce our ML model training pipeline using Vertex Pipelines (Kubeflow).

Background / Problems

Firstly, let me explain about background and problems before introducing Vertex Pipelines (Kubeflow).

Individualization of Model Training and Model Serving Code

Previously in AnyMind, each project has its own model training and serving method depending on developer. For example, a model is trained on notebook environment and released to repository, one batch function does all the training. In addition, there was no functionality to schedule model training, or to consistently train models, deploy models to the API, and perform batch prediction. Each of which was performed in its own way, including manual execution.

Therefore, it was necessary to prepare a common model training infrastructure with the necessary functionality, to reduce the amount of individualization and lowering the cost of operation and improvement.

Need for Workflow Processing

AnyMind’s service is offered across languages and countries. At the same time, for example, one of our platform AnyTag needs to provide the same functions across multiple SNSs, such as Instagram and Youtube. Different SNSs have different features, the same features need to be processed differently for each language, and models need to be created for each country.

That’s why model training process for the same machine learning function requires workflow processing such as branching and parallel processing. However, the previous environment could not support such processing.

Length of Time to Experimentation and Rewrite Experimentation Code

With the above background, there were multiple patterns of processing required for model experimentation, which made the experimentation process time-consuming to execute.

Another problem that arises in general development involving machine learning models is the need to rewrite the code used for experimentation for production purposes after the experimentation is complete. In addition to the execution time of the experimentation process, the time required for its rewriting was also a bottleneck.

Why Vertex Pipelines (Kubeflow) ?

Vertex Pipelines (Kubeflow) was well suited for the above task because of the following features:

Ability to develop model training and serving process as a workflow
Reusability of code as a component
Reusability of cached result of previous run when component is to be executed with the same inputs

Additionaly, In AnyMind we are developing and managing multiple ML functions with small members. Because Vertex Pipelines (Kubeflow) is a fully managed service, it requires less knowledge and skillset, and less resources to maintain. That’s another reason Vertex Pipelines (Kubeflow) is chosen.

On the other hand, Some requirements are not satisfied with Vertex Pipelines (Kubeflow). For example, Kubeflow’s current Python SDK (kfp v2) does not provide an interface for scheduled execution. However, because it turned out that it’s all easy to develop using Google Cloud Platform’s API for Python, We decided to develop by our own.

Implementation Details

In the following, I will describe how we actually implemented the system.

Deployment Unit Isolation

Currently, the model learning pipeline has a separated repository for each project, and components are also created on a project basis. The reason is that there are currently not many processes that are common to all projects, so there are few advantages of commonization, while the best practices for implementation have not yet been established, so the risk of commonization is high.

By storing the model training process and prediction process in a same repository, common processes such as pre-processing can be used between the them. This has allowed us to resolve processing discrepancies between model training process and prediction process.

Building Component

In Kubeflow, there are two ways to write components:

Defining interface in yaml file and process in python file
Defining interface and process as python function

All project specific components, such as model creation, are written with the former way. That’s because it makes separation among images and directories clear and reusing components between projects easy.

On the other hand, components which are dependent to pipeline, such as component to wait for previous job’s completion or component to get executed time, are written in the latter way. Those components are placed in same directory as pipeline and executed on the same image as the pipeline.

Components’ images are built by Cloud Build as follows, and built images are stored on Container Registry:

Building Pipeline

Building process of pipeline is written in python file with its workflow definition and parameters.

Building processes are following:

Overwrite component definition files
Compile pipeline
Upload compiled pipeline and its parameters to Cloud Storage
For one time execution, publish a message to Cloud Pub/Sub topic for pipeline job creation
For scheduled execution, publish a message to the Cloud Pub/Sub topic for Cloud Scheduler creation

The details of them are described later.

Following is sample code for building pipeline:

The above process is executed from Cloud Build as follows:

Overwriting component definition file

Component definition file has a path in the Container Registry of the execution image as shown below, but components are deployed in separate projects for each environment, such as staging and production.

Therefore, project id is overwritten during pipeline execution. At this time, release version is also reflected.

Specifically, the following process is added to the above pipeline building process.

Creating pipeline job

The process of creating pipeline jobs is separated to Cloud Functions and executed by publishing a message to Cloud Pub/Sub topic, as shown below.

The message contains the following information

Cloud Storage path of the compiled pipeline file saved
Cloud Storage path of the pipeline parameters saved
Other information such as release version

By publishing a message with those information to the Cloud Pub/Sub topic, Cloud Functions will create a pipeline job based on it.

The process inside Cloud Functions is as follows:

Creating a Scheduler for Periodic Execution

Pipeline jobs can be periodically created using Cloud Scheduler. As shown below, Cloud Scheduler has information to create the above pipeline job as a message, and it is published to the above Cloud Pub/Sub topic on a set schedule.

The creation of the Cloud Scheduler is done by Cloud Functions. And it is executed through Cloud Pub/Sub same as pipeline creation. The flow of scheduler creation and following periodic pipeline job creation is as follows:

Cloud Scheduler is created as follows:

Overall Build Process

To summarize what has been said so far, the overall build process for periodic execution of pipeline is as follows:

Cloud Build is submitted by CircleCI, where information on which environment (staging, production, etc.) is passed as a substitution depending on the branch. The following build process is based on the information passed to it, and the values are separated according to the environment, such as by obtaining the config file for the corresponding environment.

Connection with Other Workflows

Below I describe how the model training pipeline connects to the data pipeline and batch prediction.

Data Creation Workflow

In AnyMind, data creation process is isolated to Cloud Composer (Airflow). For this reason, we have a component in the model training pipeline that creates a flag file with time information after the data creation pipeline completes processing, waits until the flag is created, and completes when the creation is confirmed. After the completion of the component, following components are executed.

Batch Inference Creation Workflow

Batch prediction with trained models and registration of the results to the data store are done by another pipeline job on Vertex Pipelines (Kubeflow). This is because the frequency of model training and the frequency of batch prediction may differ, for example, a model may be updated once a month, but batch prediction may be performed once a day.

As in the data creation process, the model training pipeline creates flags with time information after the overall process is complete. And the batch prediction pipeline has a component that waits until the flags have been created. Batch prediction is executed after all models have been trained.

Summary

In this article, I have introduced the implementation of Vertex Pipelines (Kubeflow) as a part of machine learning system in AnyMind. I hope you find it useful.