Cost savings in VertexAI Notebooks using Terraform

Published in

Google Cloud - Community

4 min readJan 23, 2024

Implement 2 auto-shutdown cost control features as part of your IaC.

Imagine you are part of an MLOps team responsible for provisioning VertexAI Notebooks for users in your company. It is important to introduce features for cost control as part of your infrastructure. This blog post will show you how to add 2 cost-control features to a user-managed VertexAI Notebook.

Auto-shutdown for idle notebooks.

If the VertexAI Notebook is idled for some time, shut it down (do not delete it). This will set the notebook’s state to stopped, ready to be active again later on. You only get charged while the notebook instance is in the active state.

2. Auto-shutdown after first boot.

When deploying updates, like changing the machine type or updating the underlying image, the VertexAI notebooks will be recreated and set to active. These updates should be planned when users are not using the notebooks to minimize disruptions, therefore all notebooks will be active at least until the idle timeout is reached.
Instead, we would like for the machine to go into a stopped state after all the internal configuration is done and the JupyterLab service is up.

Jump to code snippet solution

The Terraform Resource

If you take a look at the Google provider documentation, it is not clear which resource is the one for VertexAI user-managed notebooks. Under Cloud AI Notebooks you will notice several resources:

google_notebooks_environment
google_notebooks_instance
google_notebooks_instance_iam
google_notebooks_locations
google_notebooks_runtime
google_notebooks_runtime_iam

I won’t explain what are all those for, but the one for user-managed notebooks is the google_notebooks_instance resource. In the documentation, the most basic example of the resource looks like this:

resource "google_notebooks_instance" "instance" {
  name = "notebooks-instance"
  location = "us-west1-a"
  machine_type = "e2-medium"
  vm_image {
    project      = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
}

Let’s now see how to adapt this resource definition to achieve both of the cost-control features.

Auto-shutdown for idle notebooks

This cost-control feature is easy to set up and it’s managed by GCP itself 🙏.

Unfortunately in the documentation, there is no mention of an auto-shutdown setting nor idle-timeout reference, while there is a configurable setting if creating the notebook in the GCP console.

After some deployments and looking at how the underlying GCE VM gets configured, it becomes obvious that it is only necessary to set the idle-timeout-seconds in the metadata of the resource.

resource "google_notebooks_instance" "instance" {
  name = "notebooks-instance"
  location = "us-west1-a"
  machine_type = "e2-medium"
  vm_image {
    project      = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
  metadata = {
    idle-timeout-seconds = "3600" # 1 hour
  }
}

The metadata configuration will be propagated to the underlying GCE VM and it will handle the shutting down of the machine.

Auto-shutdown after the first boot

This feature is a bit more tricky to set up and it will require to write some custom logic. Looking at the resource documentation we will find the parameter post-startup-script, which can receive a GCS path to a custom script and will be executed after the JupyterLab service starts. Perfect! Just what I need, I just need to write some logic to identify that the machine is starting for the first time and if it is, shut it down.

Custom script

The script notebook_post_startup_script.sh will check if a certain file exists, if it doesn’t exist it is the first boot, create the file, and shut down the machine. If the file exists do nothing.

#!/bin/bash

FLAG="/var/log/firstboot.log"
# check if file does not exists
if [[ ! -f $FLAG ]]; then
   echo "First boot. Powering it off"
   touch "$FLAG"
   sudo poweroff
else
   echo "Not first boot. Do nothing."
fi

Now that we have the script, let’s be specific on what we need as IaC:

A bucket to store our custom script. I will create one but better to use an existing one in your existing IaC.
A GCS object resource that uploads the script to the bucket.
A notebook instance resource that assigns the post-startup-script parameter to the URI of the GCS object pointing to our custom script.

Terraform code

A minimal version of what you need to add to your Terraform code looks like this:

resource "google_storage_bucket" "bucket" {
  location = "us-west1-a"
  name     = "bucket"
}

resource "google_storage_bucket_object" "script" {
  name   = "script"
  source = "notebook_post_startup_script.sh"
  bucket = google_storage_bucket.bucket.name
}

resource "google_notebooks_instance" "notebook" {
  name = "notebook"
  location = "us-west1-a"
  machine_type = "e2-medium"
  post_startup_script = "gs://${google_storage_bucket.bucket.name}/${google_storage_bucket_object.script.output_name}"
  vm_image {
    project      = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
}

Notice the string interpolation in the value of post_startup_script. It’s assumed that the custom script notebook_post_startup_script.sh is in the same directory as the resources definition but you can of course change the directory structure as desired and adapt the local path in source .

Putting it together

Below the Terraform code for implementing both cost-saving features for VertexAI user-managed notebooks.

resource "google_storage_bucket" "bucket" {
  location = "us-west1-a"
  name     = "bucket"
}

resource "google_storage_bucket_object" "script" {
  name   = "script"
  source = "notebook_post_startup_script.sh"
  bucket = google_storage_bucket.bucket.name
}

resource "google_notebooks_instance" "notebook" {
  name = "notebook"
  location = "us-west1-a"
  machine_type = "e2-medium"
  post_startup_script = "gs://${google_storage_bucket.bucket.name}/${google_storage_bucket_object.script.output_name}"
  metadata = {
    idle-timeout-seconds = "3600" # 1 hour
  }
  vm_image {
    project      = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
}

Conclusion

Even though there are other cost-saving advantages from centrally deploying IaC for VertexAI notebooks, I hope that these 2 tips can already help your deployments be more efficient.

Cheers!