Single User Jupyter Notebooks at Google Cloud

Oscar Pulido

Published in

Google Cloud - Community

5 min readJul 18, 2023

professional-services/examples/personal-workbench-notebooks-deployer at main ·…

Common solutions and tools developed by Google Cloud's Professional Services team. This repository and its contents are…

github.com

Enterprises need analytical users and data scientists to use their own identity (rather than generic service accounts) when querying and processing data on their experiments to make data usage monitoring and cost allocation easier in a governed environment.

Also notebook environments lifecycle automation is necessary to scale and serve enterprise level amount of users.

Data Scientists working with large amounts of data may need to run jobs in Spark or other distributed processing engines on Dataproc (managed Hadoop). For others, a Python kernel or single node Spark environment would be enough.

As a data platform central governance manager, you don’t need to provide analytical users with access to the GCP web Console, but to an on-demand self provisioned Jupyter environment.

The Terraform modules introduced here intend to help with the provisioning process of individual user analytical environments.

Google provides two Jupyter notebook-based options for your data science workflow:

Managed Notebooks

Managed notebooks are designed to manage provision, submission and decommission of resources via notebook instances running as Vertex AI managed VMs in a tenant project.

Identity impersonation: For Managed Notebooks to impersonate end-user identity when querying data across other GCP services (such as GCS and BigQuery), you can set Single User access mode to grant access to an specific user only, so they can login in the Jupyter environment using their own credentials.
Kernels: Managed notebooks are instances that can run Python, Spark standalone single node, R, and shell kernels.

Single-user Vertex Workbench AI Managed Notebook

The personal-managed-notebook module is intended to provide automation via Terraform to create individual managed notebooks for each end-user.

User Managed Notebooks / Dataproc Hub

User Managed notebooks allow heavy customization, and personalized images usage running as VMs in customer project.

Identity impersonation: For User Managed Notebooks / Dataproc Hub to impersonate end-user credentials when querying data, Dataproc clusters must have ‘Personal Cluster Authentication’ enabled.
Kernels: User Managed notebooks / Dataproc Hub are naive instances that allow users to create Dataproc clusters that can run heavy Spark jobs as well as Python kernels.

A specific type of User Managed Notebooks are Dataproc Hub instances, that don’t run JupyterLab but instead JupyterHub serving as a bridge for users to create Dataproc Clusters on demand and run JupyterLab there using an administrator predefined cluster template.

Dataproc Hub notebooks are administrator-curated notebooks running on a Dataproc JupyterLab cluster sit in the user’s project. Dataproc Hub helps on providing templated Dataproc notebook environments to users.

User-managed notebook / Dataproc Hub high level diagram

JupyterHub itself runs in a Notebooks instance that never hosts a Notebook server. The Notebooks instance only works to leverage the Inverting Proxy and provide a secure URL to JupyterHub to the users. When a user selects a template and creates a cluster, JupyterHub redirects the user to Dataproc Notebooks through the Component Gateway.

Main components:

JupyterHub: UI + Web Server for users to pick Jupyter notebooks server templates and start Notebooks server somewhere.
JupyterServer: Created as a Dataproc cluster by end-user from template defined by admin.
JupyterLab: Web-based user interface for project Jupyter.

Dataproc Clusters Templates

Dataproc cluster creation is triggered from Managed Notebook instance, by the end-user themselves, based on the YAML template that the Administrator makes available for them in a GCS bucket.

Using URL received from Admin, end-user will access JupyterLab

end-user will trigger cluster creation without accessing GCP console

Here we have a challenge, because we want to use Dataproc Personal Cluster Authentication (so that data is accessed using end-user credentials), the user email needs to be referenced in the YAML template file. This means we cannot have a single template for all users; instead, we need a template for each user.

For User-Managed Notebooks, the sample code includes automation to generate cluster template files for each user based on a given list, or just adding new module usages.

Self-service notebooks provisioning flow

To make the notebook environment lifecycle management a self service process, you can develop a Web UI to allow users request the environment creation. This could include choosing between a single instance that will translate into a Managed Notebook, or a distributed environment that will translate into a User Managed Notebook/DataprocHub instance.

Once the user request is captured by the Web UI, a backend could generate a module invocation file and place it in a Terraform code repository.

Having a new Terraform file in the repository could trigger an automated CI/CD pipeline in charge of applying the Terraform changes, in this case a new Managed Notebook or User Managed Notebook instance creation.

Notebook instance deletion can be automated for Managed Notebooks using idle shutdown parameter, however for User Managed Notebooks it is a little more complex as it implies deletion of both the notebook instance and the Dataproc cluster.

Conclusion

For Data Scientists running single-node Python code, data platform administrators can rely on Managed Notebooks to guarantee end-user identity usage.

For Hadoop related workflows, Managed Notebooks are not the optimal solution, because Dataproc as external kernel makes cluster lifecycle management difficult and Dataproc/Spark Serverless doesn’t support Personal Authentication yet.

Dataproc Hub via User Managed Notebook, provides an alternative for Data Scientists to create their own Dataproc Clusters, using admin-curated templates that ensure end-user identity usage via Personal Cluster Authentication.

Thanks to Daryus Medora who collaborated on this story.