VertexAI custom training deployment
Vertex AI is the unified ML platform in Google Cloud, similar to SageMaker in AWS.
When I first looked at VertexAI, I missed a quick visual representation with an overview of how custom training works. For instance, it was not trivial to understand that most of the code should run in a different project managed by Google Cloud, or how to report the Lineage/Metadata to VertexAI.
The aim is to show it here, augmented with a short explanation.
Overview
First day — how we see custom training in VertexAI
A VertexAI VM is being launched in Compute Engine each time we open a new notebook in the UI workbench. We can train in the Jupyter notebook running in this VM or use Google Cloud resources to train the model. This walkthrough assumes the latter scenario.
Steps:
- The code in the notebook creates a VertexAI dataset, which is a wrapper over a GCP location or over a BigQuery relation.
- We use the
%%writefile
Jupyter macro to generatetrain.py
training script. That’s what most online examples show. But you’d be better off using GIT in the VM or uploading your training code to Cloud Storage. Moreover, it’s better to use Python source distribution here. - Create the training pipeline AND
- Run it with the specific
dataset_id
from step 1. What happens here is shown next.
How ML architects see custom training in VertexAI
User managed notebook deployment example.
No networking specifics are shown in the diagram:
VertexAI offers managed training. It’s done in a different project, managed by GCP. Once we run the training job in step 4, the job enters a job queue in GCP’s project.
GCP kicks off the job with the train.py
or Python software distribution code provided above in step 2. Command arguments are being passed to the each worker, as described in the Job object. The data locations that were wrapped by a VertexAI dataset object are passed to the training script on each worker via enviromental variables.
Once the job completes, it typically writes the results onto GCS in your project.
The job can specify several worker pools, with different machines/arguments to each one. The CLUSTER_SPEC includes an index
of the worker. We won’t go further into Data parallel training in this introduction.
Why train in VertexAI
In cases when my model is trained on small data and does not require GPU, should I use a local Jupyter notebook?
Once we follow the VertexAI way, we get:
- Metadata, especially for hyper parameter tuning with Vizier
- Lineage information in Metadata
- MLOps with Kubeflow/TFX pipelines
- Consolidated logging/monitoring
- Model registry
- Explainable AI
- Data labeling jobs
- Docker images for training, serving & for Jupyter notebooks
Moreover, separating actual training code from the notebook itself gives better structure to your code.
The downside is the price of the VertexAI Workbench VM.
You might not need the above for small one-off research projects. For bigger/big data/complex/resource hungry projects you probably should use VertexAI or a similar solution for training on GCP.
This concludes the VertexAI custom training introduction.