Exploring the New Schedule Notebook Feature in BigQuery
Introduction
Note: This feature is currently in preview. Please exercise caution when using it in a production environment.
Google reeleased the scheduling notebook function on June 13, 2024. This exciting tool allows for streamlined data processing within BigQuery. Let’s explore this feature with a hands-on approach.
How to Schedule Notebooks
Enable Related APIs
Before you can use the BigQuery notebook, ensure the following APIs are enabled:
- Compute Engine API
- Dataform API
- Vertex AI API
Create Runtime Template
It’s advisable to create a Runtime Template beforehand for executing BigQuery notebooks. While you can launch a new runtime using the default template, it uses the default Virtual Private Cloud (VPC) and doesn’t allow you to choose the instance types.
To setup the template, configure it from “Colab Enterprise” in the Google Cloud console.
Navigate to “RUNTIME TEMPLATES” and click on “NEW TEMPLATE”.
You can also configure it through Terraform, if you’re interested.
Create Notebook
You can start a notebook through BigQuery Studio.
As shown in the examples within the notebook, you can execute both Python codes and SQL queries.
# Running this code will query a table in BigQuery and download
# the results to a Pandas DataFrame named `results`.
# Learn more here: https://cloud.google.com/bigquery/docs/visualize-jupyter
%%bigquery results --project <YOUR_PROJECT>
SELECT * FROM `bigquery-public-data.ml_datasets.penguins`
# With BigQuery DataFrames, you can use many familiar Pandas methods, but the
# processing happens BigQuery rather than the runtime, allowing you to work with larger
# DataFrames that would otherwise not fit in the runtime memory.
# Learn more here: https://cloud.google.com/python/docs/reference/bigframes/latest
import bigframes.pandas as bf
bf.close_session()
bf.options.bigquery.location = "US" #this variable is set based on the dataset you chose to query
bf.options.bigquery.project = "YOUR_PROJECT" #this variable is set based on the dataset you chose to query
df = bf.read_gbq("bigquery-public-data.ml_datasets.penguins")
Notice in Python code that we use bigframes.pandas
to extract and shape the dataset as a DataFrame
. Here’s an explaination of the differences between bigframes.pandas
and pandas
.
bigframes.pandas
- Dataform converts the pandas code to BigQuery SQL and executes it on BigQuery computing resources.
- When you loading large datasets, you don’t need to worry about out-of-memory issues in the notebook’s runtime.
- Be mindful of the computing fees associated with BigQuery.
- This library lacks some functions present in the original
pandas
.
As shown in the notebook, written code is executed on BigQuery:
And it’s converted to BigQuery SQL.
pandas
- You have access to the full functionality of
pandas
. - The code is executed in the runtime’s memory.
- The computing fee is only runtime’s running fee, but you might encounter out-of-memory errors when loading large datasets.
Schedule the Notebook
Let’s get back on track and set up the notebook schedule. Before configuring the schedule, ensure the following roles are granted to your service account:
- Notebook Executor User (
roles/aiplatform.notebookExecutorUser
) - Storage Admin (
roles/storage.admin
) - Service Account User (
roles/iam.serviceAccountUser
)
Additionally, you must grant the following role to the default Dataform service account:
The default Dataform service account is formatted like this: service-<YOUR_PROJECT_NUMBER>@gcp-sa-dataform.iam.gserviceaccount.com
.
After granting the roles to both service accounts, you can configure the schedule by clicking the “SCHEDULE” button.
Note: You cannot select the runtime template to configure the scheduled notebook. This means the runtime template uses the default template, requiring the default VPC network. If you have deleted the default VPC network, you must recreate it and configure automatic subnet creation mode again.
The configured scheduled notebook will appear like this:
And past executions will be displayed like this:
Conclusion
This article introduced the recently released Schedule Notebook function. While still in the preview stage and having some limitations, it is a powerful tool for processing data in BigQuery. The notebook allow us to write in both SQL and Python, expanding its usability. Additionally, Schedule Notebook executes cells one by one like a pipeline flow to process the data, which is note provided in scheduled queries. Moreover, extensive Python libraries enable scientific calculations and machine learning processes.
This release is quite exciting, and I can’t wait for it to reach General Availability (GA) stage!