Cheap VertexAI data syncing using GCS

Rodrigo Agundez
Google Cloud - Community
6 min readFeb 5, 2024

Learn how to sync and persist VertexAI notebooks user’s data using GCS

Photo by Jeremy McKnight on Unsplash

Imagine your users of VertexAI Workbench have created some notebooks, configuration files, etc., and the team responsible for maintaining the VertexAI infrastructure makes an update over the weekend to the VertexAI notebooks instances. The current notebook instances will be deleted and new ones will be created. On Monday, users come back, and to their surprise, all their files are gone.

This blog post will teach you a simple and cost-effective solution to this problem that can scale quite well to a large amount of users.

Jumps to code snippet solution

The Problem

Currently, there is no off-the-shelf feature in VertexAI user-managed notebooks to automatically back up the data and sync it back to newly created notebook instances.¹

If you are responsible for deploying and maintaining Vertex AI notebooks this is a BIG problem. It means that on each deployment update you make to the notebook’s configuration (change machine type or underlying image) all your user’s files will be gone. Obviously from a user’s perspective, this feature is a must! and it should work seemingly.

You can of course push the problem to your users (which I see often happen), wash your hands like Pontius Pilate, and make them responsible for backing up their data. To me, this is a mediocre way to approach the problem and is not user-centric, it makes a lot of assumptions about what the users are doing and their level of expertise. Therefore, limiting the impact you can have on your organization.

What alternatives did I consider?

  1. Mounting of external persistent disk
  2. Taking disk snapshots
  3. Syncing with GCS bucket

Without going too much into detail, the first 2 options ended up adding too much complexity, in the end is my team who has to maintain it, and I prefer Google to maintain it. If you would like to learn more about the first 2 options you can check out the following resources:

Let me now explain to you how to implement option 3, using a GCS bucket to backup and sync your data.

Syncing and backing up data with a GCS bucket

I don’t remember how I found this solution, it is surprising to me how poorly documented it is (in the Terraform or GCP documentation) while being such a great option for syncing and backing up data from VertexAI notebooks.
I believe I noticed the metadata tag gcs-data-bucket while reading the documentation of the SDK here, and that took me down the rabbit hole to understand it and propose its implementation to my team.

I won’t bore you with the details of the steps I took to investigate this but I’ll just cut to the chase. If you spin up a VetexAI user-managed notebook and SSH into the GCE VM, you will find a set of scripts in /opt/deeplearning/bin/ that handle a lot of the configuration of the VertexAI notebooks. In that directory you will find the script enable_sync_gcs.sh, which contains a well thought logic to sync the data in /home/jupyter to the GCS location determined by the metadata tag gcs-data-bucket. The syncing is done using the gsutil -m rsync command and it gets trigger by listeners on the /home/jupyter directory.

In addition, there is a piece of logic related to the GCS syncing in the /opt/deeplearning/bin/shutdown_script.sh that executes just before the notebooks instance goes into the stopped state.

# Stop file sync to GCS bucket
# Sync /home/jupyter/ folder to GCS one last time and clean up lock files
gcs_data_bucket=$(get_attribute_value gcs-data-bucket || true)
if [[ -n "${gcs_data_bucket}" ]]; then
sudo systemctl stop gcs_sync.service
sudo systemctl disable gcs_sync.service
sync_to_gcs_bucket "${gcs_data_bucket}"
instance_hostname=$(get_hostname)
lock_filename=".nb_gcs_lock_${instance_hostname}"
rm -f "/home/jupyter/${lock_filename}"
gsutil rm "gs://${gcs_data_bucket}/${lock_filename}"
fi

That lock_filename is used as a simple way to avoid multiple writes, in case the gcs-data-bucket was set equally in different notebook instances. Notice the reference to the service gcs_sync.service in the script above, but when looking into the services in/lib/systemd/system/ there is no service with such name. Can it be that we just need to create that service and GCP will handle the syncing for us? As surprising as it might be, the answer is Yes.

When contacting Google they did mention that they couldn’t guarantee that it will work though. After reviewing the logic carefully in the scripts and deploying to make tests, everything works as expected.

Now let me tell you how to create this service via Terraform.

Script to create the GCS syncing service

Creating a service is actually not that difficult, you just need to write the service file into /lib/systemd/system with the correct name gcs_sync.service .

#!/bin/bash

set -e

service_file="/lib/systemd/system/gcs_sync.service"
service_script="/opt/deeplearning/bin/sync_gcs_service.sh"
enable_sync_script="/opt/deeplearning/bin/enable_sync_gcs.sh"

echo "Create service file $service_file"
chmod +x $service_script
cat <<EOF > $service_file
[Unit]
Description=Sync Jupyter home to GCS
StartLimitIntervalSec=600
StartLimitBurst=15

[Service]
Type=simple
PIDFile=/run/gcs_sync.pid
ExecStart=${service_script}
User=jupyter
Group=jupyter
WorkingDirectory=/home/jupyter
RemainAfterExit=no

[Install]
WantedBy=multi-user.target
EOF

echo "Running script ${enable_sync_script}"
chmod +x $enable_sync_script
$enable_sync_script

I added at the end, the running of the syncing script sync_gcs_service.sh to first download the files in the GCS bucket if the notebook instance is just starting. Now that we have the script, we need a way to run it when the machine starts.

Startup-script-url vs post-startup-script

When reading the Terraform documentation about the post startup script in the VertexAI notebooks instance resource and the documentation about using a startup script in a GCE VM, you will notice that it is a bit confusing. Let me try to help:

  • post-startup-script: Optional argument in the Terraform google_notebooks_instance resource definition which points to a script in a GCS location that runs after the JupyterLab service is up and running. I used this argument before to enable idle shutdown, I go more in detail in this blog post.
  • startup-script: Metadata tag which includes a path in the local VM to a script that will run when the VM starts.
  • startup-script-url: Metadata tag which includes a GCS path to a script thay will run when the VM starts.

Perfect! We need startup-script-url.

Putting it together

Below the Terraform code for implementing the syncing of the data via a GCS bucket, where vm_startup_script.sh is the script above.

resource "google_storage_bucket" "bucket" {
location = "us-west1-a"
name = "bucket"
}

resource "google_storage_bucket_object" "script" {
name = "script"
source = "vm_startup_script.sh"
bucket = google_storage_bucket.bucket.name
}

resource "google_notebooks_instance" "notebook" {
name = "notebook-name"
location = "us-west1-a"
machine_type = "e2-medium"
vm_image {
project = "deeplearning-platform-release"
image_family = "tf-latest-cpu"
}
metadata = {
startup-script-url = "gs://${google_storage_bucket.bucket.name}/${google_storage_bucket_object.script.output_name}"
gcs-data-bucket = "gs://${google_storage_bucket.bucket.name}/__notebook_name"
}
}

Notice that in the code above I used the same bucket to store the script and to sync the data. I do not recommend this, have a separate bucket for your own scripts and dedicated buckets to store user’s data.

Conclusion

No idea why this GCS backup service is not activated by default, when I asked our Google support team, they didn’t mentioned anything in particular but just that they wouldn’t support if it would break.

Hope it helps simplify the backups and make them more cheap as well.

Cheers!

[1] The VertexAI-managed upgrade of the environment is only to the underlying VM image. To my knowledge there is no managed-upgrade via Terraform for many other type of updates, like for machine type or the JupyterLab container. For more information about VertexAI VM image upgrades please click here.

--

--