Enterprise MLOps with Google Cloud Vertex AI (part 1)

Javier Garcia Puga
Google Cloud - Community
4 min readJan 19, 2024

--

In regulated industries like finance, healthcare, and telecommunications, the deployment of applications and services, especially those involving AI/ML and customer data, presents unique challenges. Stringent requirements for security, reliability, scalability, compliance, and traceability must be met.

For example, a bank using a machine learning model to predict loan affordability must guarantee the model’s accuracy and fairness. Furthermore, it must adhere to fair lending regulations while providing a transparent audit trail to justify its decisions.

This complexity demands robust MLOps practices that incorporate data governance, continuous model monitoring, and meticulous documentation throughout the model’s lifecycle.

This article is the first in a series where we will demonstrate how to establish a comprehensive MLOps framework on Google Cloud, catering to the typical needs of such organizations, including:

  • Support for hybrid networks (Shared VPC)
  • Security constraints: restricted network access, filtered internet access & organization policies
  • Private packages/artifacts repository
  • Cost control & billing
  • Key management compliance (CMEK)
  • Integration with Source code management (SCM) and CI/CD tooling (Github, Gitlab)
  • Model documentation (model card).

For the MLOps process, we will follow the Google Cloud Practitioners Guide to Machine Learning Operations (MLOps) [1][2] and will use the following code example: https://github.com/GoogleCloudPlatform/professional-services/tree/main/examples/vertex_mlops_enterprise

The Enterprise MLOps with Google Cloud Vertex AI series contains the following articles:

Google Cloud Infrastructure setup

Before enabling the organization to experiment, train, and deploy ML models, it is critical to have a secure, enterprise-ready Google Cloud environment.

While covering a complete landing zone deployment is beyond the scope of this post, it can be easily achieved with two open-source accelerators:

Once the landing zone (or shielded folder) is set up, you need to create the required Google Cloud environments (projects) to set up the required CI/CD process. Implementing each environment as a recommended best practice. In this example, we will deploy three environments:

  • Experimentation (or development) environment
  • Staging
  • Production

The following diagram shows the different resources deployed in each environment. As you can see, all the resources deployed are similar, except for the experimentation project, which includes Jupyter Notebooks.

MLOps environments

A terraform script is provided to setup all the required resources:

  • A GCP Project (per environment) to host all the resources
  • Isolated VPC network and a subnet to be used by Vertex and Dataflow (using a Shared VPC is also possible).
  • Firewall rule to allow the internal subnet communication required by Dataflow
  • Cloud NAT (Network Address Translation) may be required to enable internet access for computing resources like Vertex AI and Dataflow. Its necessity depends on your organization’s security policies regarding internet connectivity. If internet access is restricted, Cloud NAT can provide a controlled gateway for outbound traffic.
  • GCS buckets to host Vertex AI and Cloud Build Artifacts.
  • BigQuery Dataset where the training data will be stored
  • Service account mlops-[env]@ with the minimum permissions required by Vertex and Dataflow
  • Service account github to be used by Workload Identity Federation, to federate Github identity.
  • Secret to store the Github SSH key to get access to the CI/CD code repo (you will set the secret value later, so it can be used).

Before launching the Terraform scripts, you must create a dedicated private Git repository in your user or organization. In this example, we will use a private GitHub repository, such as: https://github.com/<my_org>/mlops-example/

You can follow the following instructions for setting up the Git environment in this link.

Environments deployment guide:

  1. Clone the following repo into a temporary folder: https://github.com/GoogleCloudPlatform/professional-services
  2. Follow the deployment instructions: https://github.com/GoogleCloudPlatform/professional-services/blob/main/examples/vertex_mlops_enterprise/doc/01-ENVIRONMENTS.md

In one of the steps you will be asked to create a terraform.tfvars file, in case you have an existing GCP project, you can use the following one:

bucket_name  = "creditcards" # -env will be added as suffix
dataset_name = "creditcards"
environment = "dev"
groups = {
gcp-ml-ds = null
gcp-ml-eng = null
gcp-ml-viewer = null
}

# env will be added as branch name
github = {
organization = "ORGANIZATION"
repo = "mlops-example"
}

# Additional labels. env label will be added automatically
labels = {
"team" : "ml"
}


prefix = "myprefix"
project_config = {
billing_account_id = null # Use only billing BA if it is required to create the project
parent = null
project_id = "creditcards"
}
region = "europe-west4"

After this process, the Github repo content should be similar to this one:

With the following Github Actions:

After launching the terraform commands, make sure you proceed with the rest of the guide: Git integration with Cloud Build

Now that you have the environment ready, you can continue with the next article: ML Pipelines with Vertex AI

[1] https://cloud.google.com/resources/mlops-whitepaper

[2] https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

--

--