VertexAI custom training deployment

Published in

Google Cloud - Community

3 min readAug 15, 2022

Vertex AI is the unified ML platform in Google Cloud, similar to SageMaker in AWS.

When I first looked at VertexAI, I missed a quick visual representation with an overview of how custom training works. For instance, it was not trivial to understand that most of the code should run in a different project managed by Google Cloud, or how to report the Lineage/Metadata to VertexAI.

The aim is to show it here, augmented with a short explanation.

Overview

First day — how we see custom training in VertexAI

A VertexAI VM is being launched in Compute Engine each time we open a new notebook in the UI workbench. We can train in the Jupyter notebook running in this VM or use Google Cloud resources to train the model. This walkthrough assumes the latter scenario.

Steps:

The code in the notebook creates a VertexAI dataset, which is a wrapper over a GCP location or over a BigQuery relation.
We use the %%writefile Jupyter macro to generate train.py training script. That’s what most online examples show. But you’d be better off using GIT in the VM or uploading your training code to Cloud Storage. Moreover, it’s better to use Python source distribution here.
Create the training pipeline AND
Run it with the specific dataset_id from step 1. What happens here is shown next.

How ML architects see custom training in VertexAI

User managed notebook deployment example.
No networking specifics are shown in the diagram:

VertexAI offers managed training. It’s done in a different project, managed by GCP. Once we run the training job in step 4, the job enters a job queue in GCP’s project.
GCP kicks off the job with the train.py or Python software distribution code provided above in step 2. Command arguments are being passed to the each worker, as described in the Job object. The data locations that were wrapped by a VertexAI dataset object are passed to the training script on each worker via enviromental variables.

Once the job completes, it typically writes the results onto GCS in your project.

The job can specify several worker pools, with different machines/arguments to each one. The CLUSTER_SPEC includes an index of the worker. We won’t go further into Data parallel training in this introduction.

Why train in VertexAI

In cases when my model is trained on small data and does not require GPU, should I use a local Jupyter notebook?

Once we follow the VertexAI way, we get:

Metadata, especially for hyper parameter tuning with Vizier
Lineage information in Metadata
MLOps with Kubeflow/TFX pipelines
Consolidated logging/monitoring
Model registry
Explainable AI
Data labeling jobs
Docker images for training, serving & for Jupyter notebooks

Moreover, separating actual training code from the notebook itself gives better structure to your code.

The downside is the price of the VertexAI Workbench VM.

You might not need the above for small one-off research projects. For bigger/big data/complex/resource hungry projects you probably should use VertexAI or a similar solution for training on GCP.

This concludes the VertexAI custom training introduction.

VertexAI custom training deployment

Overview

First day — how we see custom training in VertexAI

How ML architects see custom training in VertexAI

Why train in VertexAI

Written by Boris Litvak