How to start Jupyter in Google Cloud — the Python way
Did you know that you can easily start a Jupyter notebook in Google Cloud with Python SDK? And automatically mount GCS buckets, add GPUs or use your own containers?
This is the first article in a series on Manipulating GCP Vertex AI with Python SDK. Subscribe to get the next blog post.
Google launched its new machine learning platform Vertex AI in May 2021, succeeding the previous AI Platform. They also released SDKs for multiple languages. Let us focus on Python as that would be the first choice for many data scientists.
Install Python SDK
Google currently maintains two Python libraries for Vertex AI services:
- google-cloud-notebooks — for manipulating Vertex AI Workbench (aka Jupyter Notebook on GCP)
- google-cloud-platform — for manipulating everything else
So for now, we just install the notebook lib:
pip install google-cloud-notebooks
Documentation on Python libraries for GCP can be sometimes hard to find, notebooks SDK is described at https://googleapis.dev/python/notebooks/latest/index.html.
Authorization with Google Cloud
I will assume you already created GCP Project and Vertex AI is enabled. If not, follow the official documentation.
There are multiple ways to authenticate to Google Cloud. For our purposes, we will leverage the gcloud CLI tool. We will log-in in the terminal and then the Notebook Service in Python will be able to automatically pick up the credentials.
gcloud auth login
Alternatively, you could use a service account:
gcloud auth activate-service-account --key-file=$KEY_FILE
GCP permissions
To be able to start and delete notebooks, you will need roles/notebooks.runner. However, for the notebook deletion, roles/notebooks.admin is required.
Python Notebook Service
Google offers two ways to manipulate with notebooks — Async Client and Sync Client. It is entirely up to you which to choose. For the sake of simplicity, I will use the Sync Client for the rest of the tutorial. Both are documented here.
Now we can initiate the Notebook Service Client:
from google.cloud import notebooks_v1client = notebooks_v1.NotebookServiceClient()
Define the notebook
Now the funny part. You have many possibilities in defining your notebook environment and adding as much computing power as your budget allows.
Choosing environment and machine type
You can either base your notebook on a virtual machine image that Google has offered for you or use your docker image. I will cover building images for notebooks in some of the next articles, so let's first focus on the VM images Google offers.
There are multiple Deep Learning VM images available in GCP and for our convenience, many of them also come with preinstalled Jupyter and configured port settings. They have images for Tensorflow, Pytorch, R, and other frameworks. The full list can be found here. In one of the next articles, I will cover how to also easily set up a Julia notebook.
Assuming we want to use Tensorflow 2.8 with GPU support, I will choose VM image family tf-ent-2-8-cu113-notebooks
.
Vertex AI Workbench is just a Compute Engine instance with running Jupyter. You can find more information on available Compute Engine machine types in the documentation. For our tutorial purposes, I will choose n1-standard-8
— standard machine type with 8 vCPUs and 32 GB memory. However, you can go as crazy as you wish.
GPUs and disks
There are many restrictions on GPU usage regarding machine types and locations. I will choose NVIDIA Tesla P100. Read the Google page on GPU choosing. It is also possible to attach more GPUs (usually 1,2,4 or 8) but I will go with only one.
When you create a Notebook in GCP, Google creates two disks for you.
- Boot disk — where the OS/libraries/initialization scripts live
- Data disk — which is mapped to /home/jupyter folder
For each, you can choose between standard / balanced / SSD Persistent Disk. And each can vary in size from 100GB to 64 TB. For sake of simplicity, we can choose balanced disks with 200GB for both options.
Creating the notebook request
With the information we already have, we can build our notebook request like this:
from google.cloud.notebooks_v1.types import Instance, VmImage
notebook_instance = Instance(
vm_image=VmImage(
project="deeplearning-platform-release",
image_family="tf-ent-2-8-cu113-notebooks",
),
machine_type="n1-standard-8",
accelerator_config=Instance.AcceleratorConfig(
type_=Instance.AcceleratorType.NVIDIA_TESLA_P100, core_count=1
),
install_gpu_driver=True,
boot_disk_type=Instance.DiskType.PD_BALANCED,
boot_disk_size_gb=200,
data_disk_type=Instance.DiskType.PD_BALANCED,
data_disk_size_gb=200,
)
That is the basic stuff. You can add more parameters such as labels, tags, metadata, or instance owners. Go here to read all parameters.
Sending the request to GCP
Python SDK often uses something called parent. Parent is just a string containing information about your GCP project and the location you want to use.
Now we are ready to send the request to the GCP endpoint.
project_id = "PROJECT_ID" # Put your own project id here
location = "europe-west1-a" # Put your own location here
parent = f"projects/{project_id}/locations/{location}"request = notebooks_v1.CreateInstanceRequest(
parent=parent,
instance_id="my-first-notebook",
instance=instance,
)
op = client.create_instance(request=request)
op.result()
The result from client.create_instance
is Google's Long-running operation — read the docs. The simplest way to wait for the operation to finish is the method op.result()
, which also gives you information about the created notebook or errors during creation.
Here is the complete code:
TIP #1: Mount GCS buckets using startup script
As we said before, Jupyter on GCP is just a configured Compute Engine(CE) virtual machine. Every CE virtual machine can have a startup script — bash or non-bash file executed during the machine startup process. This gives you endless possibilities to boost your Jupyter notebook. I will only show you how to mount GCS buckets with gcsfuse automatically.
Be aware that the startup script runs as the root
user. But when you connect to Jupyter you will be jupyter
user.
gcsfuse is already preinstalled in Google's provided images. In other images, you will have to install it yourself.
#!/bin/bashLOCAL_PATH=/home/jupyter/mounted/gcs
BUCKET_NAME=my-super-bucket # Change this to your bucket
BUCKET_DIR=notebook_data
sudo su -c "mkdir -p $LOCAL_PATH"
sudo su -c "gcsfuse --implicit-dirs --only-dir=$BUCKET_DIR $BUCKET_NAME $LOCAL_PATH"
I am using two tricks here.
—-implicit-dirs
for implicit existence of directories, this allows you to see all objects in the bucket, but it comes with several drawbacks, mainly it's more costly—-only-dir
to mount only one “directory” of the bucket
This script must be either saved in GCS or on publicly available URL (e.g. Github). I will show you how to upload a string directly to GCS and save as a text without saving locally.
pip install google-cloud-storage
Now only add one more argument to the notebook_instance
before sending to GCP:
notebook_instance.post_startup_script = f"gs://{blob.bucket.name}/{blob.name}"request = notebooks_v1.CreateInstanceRequest(
parent=parent,
instance_id="my-first-notebook",
instance=notebook_instance,
)
op = client.create_instance(request=request)
op.result()
TIP #2: Managed notebooks
Google offers two solutions for Jupyter notebooks. So far we used user-managed notebook that allows a lot of customization. The second option is managed notebook where you can actually write code first and choose hardware later. The API is very similar and you can find the documentation here.
TIP #3: Read the article by Lak Lakshmanan
How to use Jupyter on a Google Cloud VM is an excellent article by Lak Lakshmanan (ex-Google) on starting Jupyter on GCP with the gcloud CLI tool. It is a bit older but contains some great tips.
Please subscribe to my channel to get the next blog post on how to use Vertex AI Python SDK. Apart from showing other Vertex AI services, I will soon publish a post showing also how to easily start Jupyter with Julia with Vertex AI Workbench and Cloud Build.