VertexAI custom training deployment

Boris Litvak
Google Cloud - Community
3 min readAug 15, 2022

Vertex AI is the unified ML platform in Google Cloud, similar to SageMaker in AWS.

When I first looked at VertexAI, I missed a quick visual representation with an overview of how custom training works. For instance, it was not trivial to understand that most of the code should run in a different project managed by Google Cloud, or how to report the Lineage/Metadata to VertexAI.

The aim is to show it here, augmented with a short explanation.

Overview

First day — how we see custom training in VertexAI

VertexAI Workbench for training

A VertexAI VM is being launched in Compute Engine each time we open a new notebook in the UI workbench. We can train in the Jupyter notebook running in this VM or use Google Cloud resources to train the model. This walkthrough assumes the latter scenario.

Steps:

  1. The code in the notebook creates a VertexAI dataset, which is a wrapper over a GCP location or over a BigQuery relation.
  2. We use the %%writefile Jupyter macro to generate train.py training script. That’s what most online examples show. But you’d be better off using GIT in the VM or uploading your training code to Cloud Storage. Moreover, it’s better to use Python source distribution here.
  3. Create the training pipeline AND
  4. Run it with the specific dataset_id from step 1. What happens here is shown next.

How ML architects see custom training in VertexAI

User managed notebook deployment example.
No networking specifics are shown in the diagram:

VertexAI custom training

VertexAI offers managed training. It’s done in a different project, managed by GCP. Once we run the training job in step 4, the job enters a job queue in GCP’s project.
GCP kicks off the job with the train.py or Python software distribution code provided above in step 2. Command arguments are being passed to the each worker, as described in the Job object. The data locations that were wrapped by a VertexAI dataset object are passed to the training script on each worker via enviromental variables.

Once the job completes, it typically writes the results onto GCS in your project.

The job can specify several worker pools, with different machines/arguments to each one. The CLUSTER_SPEC includes an index of the worker. We won’t go further into Data parallel training in this introduction.

Why train in VertexAI

In cases when my model is trained on small data and does not require GPU, should I use a local Jupyter notebook?

Once we follow the VertexAI way, we get:

Moreover, separating actual training code from the notebook itself gives better structure to your code.

The downside is the price of the VertexAI Workbench VM.

You might not need the above for small one-off research projects. For bigger/big data/complex/resource hungry projects you probably should use VertexAI or a similar solution for training on GCP.

This concludes the VertexAI custom training introduction.

--

--

Boris Litvak
Google Cloud - Community

Data & Cloud Architect, Certified Architect & Data Engineer in both AWS and GCP