Enterprise MLOps with Google Cloud Vertex AI (part 1)

Published in

Google Cloud - Community

4 min readJan 19, 2024

In regulated industries like finance, healthcare, and telecommunications, the deployment of applications and services, especially those involving AI/ML and customer data, presents unique challenges. Stringent requirements for security, reliability, scalability, compliance, and traceability must be met.

For example, a bank using a machine learning model to predict loan affordability must guarantee the model’s accuracy and fairness. Furthermore, it must adhere to fair lending regulations while providing a transparent audit trail to justify its decisions.

This complexity demands robust MLOps practices that incorporate data governance, continuous model monitoring, and meticulous documentation throughout the model’s lifecycle.

This article is the first in a series where we will demonstrate how to establish a comprehensive MLOps framework on Google Cloud, catering to the typical needs of such organizations, including:

Support for hybrid networks (Shared VPC)
Security constraints: restricted network access, filtered internet access & organization policies
Private packages/artifacts repository
Cost control & billing
Key management compliance (CMEK)
Integration with Source code management (SCM) and CI/CD tooling (Github, Gitlab)
Model documentation (model card).

For the MLOps process, we will follow the Google Cloud Practitioners Guide to Machine Learning Operations (MLOps) [1][2] and will use the following code example: https://github.com/GoogleCloudPlatform/professional-services/tree/main/examples/vertex_mlops_enterprise

The Enterprise MLOps with Google Cloud Vertex AI series contains the following articles:

Google Cloud Infrastructure setup (this post)
Machine Learning Pipeline Development
CI/CD pipeline

Google Cloud Infrastructure setup

Before enabling the organization to experiment, train, and deploy ML models, it is critical to have a secure, enterprise-ready Google Cloud environment.

While covering a complete landing zone deployment is beyond the scope of this post, it can be easily achieved with two open-source accelerators:

Fabric FAST, for deploying a fully enterprise-ready Google Cloud organization: https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/fast
Shielded Folder, to deploy a secure folder in your existing Google Cloud organization: https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/blueprints/data-solutions/shielded-folder

Once the landing zone (or shielded folder) is set up, you need to create the required Google Cloud environments (projects) to set up the required CI/CD process. Implementing each environment as a recommended best practice. In this example, we will deploy three environments:

Experimentation (or development) environment
Staging
Production

The following diagram shows the different resources deployed in each environment. As you can see, all the resources deployed are similar, except for the experimentation project, which includes Jupyter Notebooks.

A terraform script is provided to setup all the required resources:

A GCP Project (per environment) to host all the resources
Isolated VPC network and a subnet to be used by Vertex and Dataflow (using a Shared VPC is also possible).
Firewall rule to allow the internal subnet communication required by Dataflow
Cloud NAT (Network Address Translation) may be required to enable internet access for computing resources like Vertex AI and Dataflow. Its necessity depends on your organization’s security policies regarding internet connectivity. If internet access is restricted, Cloud NAT can provide a controlled gateway for outbound traffic.
GCS buckets to host Vertex AI and Cloud Build Artifacts.
BigQuery Dataset where the training data will be stored
Service account mlops-[env]@ with the minimum permissions required by Vertex and Dataflow
Service account github to be used by Workload Identity Federation, to federate Github identity.
Secret to store the Github SSH key to get access to the CI/CD code repo (you will set the secret value later, so it can be used).

Before launching the Terraform scripts, you must create a dedicated private Git repository in your user or organization. In this example, we will use a private GitHub repository, such as: https://github.com/<my_org>/mlops-example/

You can follow the following instructions for setting up the Git environment in this link.

Environments deployment guide:

Clone the following repo into a temporary folder: https://github.com/GoogleCloudPlatform/professional-services
Follow the deployment instructions: https://github.com/GoogleCloudPlatform/professional-services/blob/main/examples/vertex_mlops_enterprise/doc/01-ENVIRONMENTS.md

In one of the steps you will be asked to create a terraform.tfvars file, in case you have an existing GCP project, you can use the following one:

bucket_name  = "creditcards" # -env will be added as suffix
dataset_name = "creditcards"
environment  = "dev"
groups = {
  gcp-ml-ds     = null
  gcp-ml-eng    = null
  gcp-ml-viewer = null
}

# env will be added as branch name
github = {
  organization = "ORGANIZATION"
  repo         = "mlops-example"
}

# Additional labels. env label will be added automatically
labels = {
  "team" : "ml"
}


prefix = "myprefix"
project_config = {
  billing_account_id = null # Use only billing BA if it is required to create the project
  parent             = null
  project_id         = "creditcards"
}
region = "europe-west4"