Part IV: Operationalize and Accelerate ML Process with Google Cloud AI Pipeline

Introduction

For a data scientist, building ML models is highly focused on tasks such as feature engineering and model algorithms — mostly programming tasks with Python or so. In reality, operationalizing ML and AI requires end to end components from data sources to serving and monitoring ML models.

Google AI pipeline is a combination of Kubeflow Pipeline and Tensorflow Extension (TFX) framework that enables robust deployment of ML pipelines along with auditing and monitoring. As a part of Google AI Platform, AI pipeline enables developers to rapidly deploy multiple models and pipelines by leveraging reusable components of the pipeline.

Key benefits of GCP AI Pipeline:

  • Push button installation for kubernetes cluster for ML Pipeline
  • Enterprise ML deployment with logging and monitoring
  • Automatic tracking of metadata and artifacts
  • Reusable pipeline templates and pipeline components
  • Integration with GCP services like BigQuery, Dataflow, AI Platform, and many others
  • AI Pipeline Documentation and Introduction Blog

Architecture

AI Platform Pipelines (add URL) makes it easier to get started with MLOps by saving you the difficulty of setting up Kubeflow Pipelines with TensorFlow Extended (TFX). Kubeflow Pipelines is an open source platform for running, monitoring, auditing, and managing ML pipelines on Kubernetes. TFX is an open source project for building ML pipelines that orchestrate end-to-end ML workflows.

AI Pipeline Template

AI Platform Pipelines saves you the difficulty of:

With AI Platform Pipelines, you can set up a Kubeflow Pipelines cluster in 15 minutes, so you can quickly get started with ML pipelines. AI Platform Pipelines also creates a Cloud Storage bucket, to make it easier to run pipeline tutorials and get started with TFX pipeline templates.

Lets deploy default AI pipeline templates that come with GCP AI Platform by setting up our Kubernetes cluster and Jupyter notebook environment.

From GCP Console → AI Platform → Pipelines , A AI Platform Pipelines page.

If you do not have any cluster created, click on New Instance to create a kubernetes cluster.

Click on Configure to continue.

Important: For a new cluster, select zone as: us-central1-a, make sure to check ‘Allow access to Cloud APIs’ box. Click Create Cluster. Wait for the cluster to be created, deploy the app instance.

You can see your pipeline cluster on AI Platform → Pipelines. Here is documentation for more detail on setting up AI Platform Pipelines. Settings of cluster will highlight cluster endpoint url that you would need to use within pipeline development in the next section.

Click on OPEN PIPELINE DASHBOARD from the Pipeline cluster information page.

Open TF 2.1 Notebook to see Pipeline template.

You can continue to open a notebook instance created for you. You can create different notebook environments of your choice as needed. Open JupyterLab.

Using AI Pipeline Template Example

An example notebook of creating a TFX based ML pipeline is displayed from AI Hub to get you started quickly. You can review template instructions to deploy an example of ML model pipeline to predict taxi tips classification.

TFX template is organized with a set of reusable components that makes deploying ML model pipeline as easy as plug-and-play by replacing data source or model algorithm or feature set. Lets review template components. Let’s look at a screenshot of an illustrative workflow that was run on Kubeflow Pipelines. This is just an illustrative workflow and users can author and run many different kinds of workflow topology with different code&tools in the various steps of the workflow.

While using templates for Tensorflow Extension (TFX), one can quickly update/modify components to fit their use case and create a new pipeline.

The template pipeline provides the following code structure that you can quickly modify for your own ML application. We will be using the same template structure to deploy Covid-19 ML pipeline.

  1. template.jpynb — notebook describes steps to set the environment and build a pipeline.
  2. kubeflow_dag_runner.py — define runners for each orchestration engine — kubeflow.
  3. pipeline.py — defines TFX components and a pipeline.
  4. configs.py — defines common constants for pipeline runners
  5. features.py — Define constants here that are common across all models including features names, label and size of vocabulary.
  6. hparams.py — hyperparameters setting for model training performance.
  7. model.py — tf.estimator for model definition leveraging features and parameters.
  8. preprocessing.py — define your preprocessing of features.

In addition to above modularized pipeline structure, template also provides few utilities:

  1. features_test.py — to write and test your features
  2. model_test.py — test your model and evaluate
  3. preprocessing_test.py — write and test pre-processing functions

You can test the default template by following notes within a notebook. For additional information on working with templates and sample pipelines, you can use these examples.

Deploying Covid19 Pipelines Template

Make sure to have created the pipeline cluster as described in earlier sections and have launched the jupyterlab environment.

Create a new notebook under /AIHub/ folder and run the following command to bring demo/template notebook for peptide prediction.

!git clone https://github.com/testpilot0/covid.git

Once commands copy git repository to your notebook environment, you will see covid folders and lab folder under/covid/lab

Select and open lab4 notebook to follow detailed detailed instructions to work through the peptide prediction pipeline. For more detailed steps, please see this work-book. Let’s examine, key sections of the notebook. Open a demo pipeline just copied and read through the instructions for each step as you progress:

Step 1: Setup your Environment

Execute commands to set versions of Tensorflow Extension (TFX) and validate setup. Since the notebook environment launched with python 3 already, ignore errors as described in the section. Key checkpoints:

  1. Validate TFX version
  2. Validate your GCP project name
  3. Must update your kubeflow cluster ENDPOINT variable.

Step 2: Copy predefined pipeline template for peptide prediction

This step creates a working directory and copies required python files for a pipeline. Make sure to create a folder under /AIHub/ and run required commands. Key checkpoints:

  1. Provide a pipeline name, you can keep the default name for a demo.
  2. Make sure to create a folder with same name as your pipeline name
  3. Run a command to bring template files into your demo folder
  4. Validate your current working directory

Step 3: Validate your template files

Brief introduction to each file and list files to validate, run a small test file. Key check points:

  1. List of files

Step 4: Run your Peptide Prediction Pipeline Demo

Update configuration file to reference to your GCS bucket where pipeline output and model export will be pushed to store. Key checkpoints:

  1. Must update GCS bucket name in a config file.
  2. Validate other variables in a config file
  3. Run and test your pipeline
  4. Check pipeline on pipeline dashboard of AI Platform

Run pipeline experiment from notebook and review execution on a pipeline dashboard’s experiment section. While the run job is executing, you can monitor pipeline experiment progress on the dashboard as well.

Click on Experiment to see details of each step of the pipeline. Following chart displays full execution of pipeline, your template/demo pipeline has many components, which you can configure to add/remove by selecting them in a pipeline.py file.

Click on components such as statisticgen and evaluator to see details of data distribution and model performance.

Section 5 [optional]: Manage components of your pipeline

You can add/remove components for data validation including StatisticsGen, SchemaGen, and ExampleValidator. If you are interested in data validation, please see Get started with Tensorflow Data Validation.

Section 6 [optional]: BigQueryExampleGen

You demo pipeline is set up with sample data provided as a csv file. Practically, you would like to bring data from storage cloud or even more so from BigQuery where you may have large datasets. Use this section to learn more about how you can replace csvExampleGen components with BigQuery. Make sure to validate configuration requirements as described in this section of notebook. You can leverage full epitope data sets from BigQuery’s public dataset or from your own project. Key checkpoints:

  1. Update pipeline.py file with BQExampleGen instead of csv
  2. Make sure of project variables in a config file that has your data set in BQ
  3. Make sure to validate BQ query arguments in a config file
  4. Make sure to update kubeflow runner file to enable query parameters
  5. Make sure to update features if you plan to use more or less of data set attributes

Deploy Model to Serve Prediction

Section 7 [optional]: Deploy model for prediction

This section provides configuration leveraging AI platforms for training and then deploying a model for serving. Optionally, you can also deploy a model directly through the AI Platform Model dashboard. (Internal Note: Bug in process to fix api call to build model from notebook.)

Image below shows, running job on an AI platform that trains a model and deploys a model. This is particularly useful when you want a customized platform compute and large dataset to train models! Once the job is completed, you can see the model deployed on the model tab.

By default, your model will be saved in your GCS bucket → tfx_pipeline_output folder. Lets deploy a version of the model to serve for prediction.

Click on Create Model on Model page of AI Platform dashboard.

Click on create version of the model, here we can leverage saved models from GCS to be deployed as one of the versions of this model. You can deploy multiple versions as needed.

Make sure to select TF version 2.1 for framework and run time.

Make sure that to select a saved model, select parent dir from GCS from your model output dir → serving_dir → folder_name that contains saved_model.pb file.

Deploy the model, it is now ready to serve. Click on the deployed version, you can test it online or through the batch process. More details on the deployed model can be found here.

Section 8 [optional]: Bring your own model and data to the pipeline

We made a pipeline for a model using the sample epitope data set. Now it’s time to put your data into the pipeline. Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data. Key checkpoints:

  1. If your data is stored in files, modify the DATA_PATH in kubeflow_dag_runner.py or beam_dag_runner.py and set it to the location of your files. If your data is stored in BigQuery, modify BIG_QUERY_QUERY in configs.py to correctly query for your data.
  2. Add features in features.py.
  3. Modify preprocessing.py to transform input data for training.
  4. Modify model.py and hparams.py to describe your ML model.

Step 9: Cleaning Up Resources

To clean up all Google Cloud resources used in this demo project, you can delete the Google Cloud project you used for the tutorial.

Conclusion and Future work

We are working on adding new components to compliment this pipeline. We will add modules to model virus mutations and predict new potential candidates for peptide vaccines if viruses mutate to the new strain. Also, we are working on a drug design pipeline which will perform ligand screening based on 3D structure of target virus protein.

I don’t speak for my employer. This is not official Google work. Any errors that remain are mine, of course.

--

--

Jignesh Mehta
Google AI Platform for Predicting Vaccine Candidate

Google Data Analytics Specialist, Driving Digital Transformation and Solutions with Cloud Data Platform, Advanced Analytics and AI/ML