Monitor and Secure Vertex AI Pipeline

Meenakshi Upadhyay
Google Cloud - Community
8 min readMar 24, 2023

Vertex AI Pipelines helps you to automate, monitor, and govern your ML systems by orchestrating your ML workflow in a serverless manner, and storing your workflow’s artifacts using Vertex ML Metadata. By storing the artifacts of your ML workflow in Vertex ML Metadata, you can analyze the lineage of your workflow’s artifacts — for example, an ML model’s lineage may include the training data, hyperparameters, and code that were used to create the model.

Pipelines in ML can be defined as sets of connected jobs that perform complete or specific parts of the ML workflow (ex: training pipeline).

Fig 1: Example of a simple training pipeline
Fig 2: Example of a training pipeline on Vertex AI Pipelines using Kubeflow

Designed properly, pipelines have the benefit of being reproducible, and highly customizable. These two properties make experimenting with it and deploying it in production a relatively easy task. Using Vertex AI Pipelines along with Kubeflow helped us rapidly design and run custom pipelines that have the above mentioned properties.

This blog post focuses on how to set up your Cloud foundations to cater specifically to the Vertex AI platform and its configuration to be able to set up proper Vertex AI foundations for your future machine learning operations (MLOps) and ML/AI use cases.

Please refer below sections for monitoring and securing your Vertex AI:

1. Vertex AI access control with IAM

Vertex AI uses IAM to manage access to resources. You can manage access at the project level or resource level. To manage access to Vertex AI Workbench resources, see the access control pages for managed notebooks or user-managed notebooks.

2. IAM permissions

When an identity calls a Google Cloud API, Vertex AI requires that the identity has the appropriate permissions to use the resource. You can grant permissions by granting roles to a user, a group, or a service account.

To see the list of permissions for performing a specific operation.

3. Use a custom service account

When Vertex AI runs, it generally acts with the permissions of one of several service accounts that Google creates and manages for your Google Cloud project. To grant Vertex AI increased access to other Google Cloud services in certain contexts, you can add specific roles to Vertex AI’s service agents.

However, customizing the permissions of service agents might not provide the fine-grained access control that you want. Some common use cases include:

  • Allowing fewer permissions to Vertex AI jobs and models. The default Vertex AI service agent has access to BigQuery and Cloud Storage.
  • Allowing different jobs access to different resources. You might want to allow many users to launch jobs in a single project, but grant each user’s jobs access only to a certain BigQuery table or Cloud Storage bucket.

For example, you might want to individually customize every custom training job that you run to have access to different Google Cloud resources outside of your project.

Moreover, customizing the permissions of service agents does not change the permissions available to a container that serves predictions from a custom-trained model.

To customize access each time you perform custom training or to customize the permissions of a custom-trained model's prediction container, you must use a custom service account.

4. Customer-managed encryption keys (CMEK)

When you run an AutoML or custom training job, your code runs on one or more virtual machine (VM) instances managed by Vertex AI. When you enable CMEK for Vertex AI resources, the key that you designate, rather than a key managed by Google, is used to encrypt data on the boot disks of these VMs. The CMEK key encrypts the following kinds of data:

  • The copy of your code on the VMs.
  • Any data that gets loaded by your code.
  • Any temporary data that gets saved to the local disk by your code.
  • Automl-trained models.
  • Media files (data) uploaded into media datasets.

In general, the CMEK key does not encrypt metadata associated with your operation, like the job’s name and region, or a dataset’s display name. Metadata associated with operations is always encrypted using Google’s default encryption mechanism.

For datasets, when a user imports data into dataset, the data items and annotations are CMEK-encrypted. The dataset display name is not CMEK-encrypted.

For models, the models stored in the storage system (for example, disk) are CMEK-encrypted. All the model evaluation results are CMEK-encrypted.

For endpoints, all model files used for the model deployment under the endpoint are CMEK-encrypted. This does not include any in-memory data.

For batch prediction, any temporarily files (such as model files, logs, VM disks) used to execute the batch prediction job are CMEK-encrypted. Batch prediction results are stored in the user provided destination. Consequently, Vertex AI respects the default value of the destination’s encryption config. Otherwise, results will also be encrypted with CMEK.

For data labeling, any input files (image, text, video, tabular), temporary discussion (for example, questions, feedback) and output (labeling result) are CMEK-encrypted. The annotation spec display names are not CMEK-encrypted.

5. Vertex AI audit logging

Google Cloud services write audit logs to help you answer the questions, “Who did what, where, and when?” within your Google Cloud resources.

The following types of audit logs are available for Vertex AI:

  • Admin Activity audit logs
  • Includes “admin write” operations that write metadata or configuration information.
  • You can’t disable Admin Activity audit logs.
  • Data Access audit logs
  • Includes “admin read” operations that read metadata or configuration information. Also includes “data read” and “data write” operations that read or write user-provided data.
  • To receive Data Access audit logs, you must explicitly enable them.

6. Access Transparency in Vertex AI

Access Transparency provides you with logs that capture the actions Google personnel take when accessing your content.

Cloud Audit Logs show when members of your organization access content in your Google Cloud projects. Similarly, Access Transparency provides logs of the actions taken by Google personnel.

You can enable Access Transparency for a Google Cloud project, if the project resides in an organization.

Supported services:

Access Transparency supports the following Vertex AI services:

Limitations of Access Transparency in Vertex AI :

All access to your data in Vertex AI by Google personnel is logged, except for the following scenarios:

7. VPC Service Controls with Vertex AI

VPC Service Controls can help you mitigate the risk of data exfiltration from Vertex AI. Use VPC Service Controls to create a service perimeter that protects the resources and data that you specify. For example, when you use VPC Service Controls to protect Vertex AI, the following artifacts can’t leave your service perimeter:

  • Training data for an AutoML model or custom model
  • Models that you created
  • Requests for online predictions
  • Results from a batch prediction request

8. Set up VPC Network Peering

You can configure Vertex AI to peer with Virtual Private Cloud (VPC) to connect directly with certain resources in Vertex AI, including:

This guide shows how to set up VPC Network Peering to peer your network with Vertex AI resources.

9. Monitor Metrics

Vertex AI exports metrics to Cloud Monitoring. Vertex AI also shows some of these metrics in the Vertex AI Google Cloud console. You can use Cloud Monitoring to create dashboards or configure alerts based on the metrics. For example, you can receive alerts if a model’s prediction latency in Vertex AI gets too high.

To view a list of most metrics that Vertex AI exports to Cloud Monitoring, see the “aiplatform” section of the Monitoring Google Cloud metrics page. For custom training metrics, see metric types that start with training in the "ml" section of that page.

10. Use a shielded virtual machine with user-managed notebooks

So you can be confident that your instances have not been compromised by boot- or kernel-level malware or rootkits, Shielded VM offers verifiable integrity of Compute Engine VM instances. Shielded VM’s verifiable integrity is achieved through the use of Secure Boot, virtual trusted platform module (vTPM)-enabled Measured Boot, and integrity monitoring.

For more information, see Shielded VM.

Requirements and limitations:

To use Shielded VM with user-managed notebooks, you must create a Deep Learning VM Images with a Debian 10 OS that is version M51 or higher.

While using Vertex AI Workbench, you can’t use shielded VM user-managed notebooks instances that use GPU accelerators.

11. Authenticate to Vertex AI Workbench

Vertex AI Workbench supports programmatic access. How you authenticate to Vertex AI Workbench depends on how you access the API. You can access the API in the following ways:

12. Configure email notifications

Vertex AI Pipelines can notify you of the success or failure of a pipeline run. When the pipeline exits, Google Cloud sends a final status notification email to the email addresses that you specify.

Configure email notifications from a pipeline by using the Email notification component in the Google Cloud Pipeline Components SDK.

13. View pipeline job logs

After you define, build, and run a pipeline, you can use Cloud Logging to create log entries to help you monitor events such as pipeline failures. With Cloud Logging, you can create custom log-based metrics and alerts. For example, you might want to receive a notification when the rate of a pipeline exceeds a given threshold.

This feature has costs associated with it. For more information, see Cloud Logging pricing.

14. Visualize and analyze pipeline results

Vertex AI Pipelines lets you run machine learning (ML) pipelines that were built using the Kubeflow Pipelines SDK or TensorFlow Extended in a serverless manner.

Compare and visualize pipeline runs using Google Cloud console.

15. Monitor model quality

A model deployed in production performs best on prediction input data that is similar to the training data. When the input data deviates from the data used to train the model, the model’s performance can deteriorate, even if the model itself hasn’t changed.

  • Vertex AI Model Monitoring monitors models for training-serving skew and prediction drift and sends you alerts when the incoming prediction data skews too far from the training baseline. You can use the alerts and feature distributions to evaluate whether you need to retrain your model.

Conclusion

An increasing number of Enterprise customers are adopting ML/AI as their core transformational pillars, in order to differentiate, increase revenue, reduce costs and maximize efficiency. For many customers ML/AI adoption can be a challenging endeavor not only because of the broad spectrum of applications ML/AI can support, deciding on which one to prioritize can be a challenge, but because moving these solutions into production require a series of security, access and data assessments and features that some ML/AI platforms might not have.

In this blog, we have covered some of the steps to set up proper Vertex AI foundations for your future machine learning operations (MLOps) and ML/AI use cases.

………..Thank you for reading the blog and have an amazing day!…………

--

--