Exploring the New Schedule Notebook Feature in BigQuery

Yusuke Enami(Kishishita)
4 min readJun 16, 2024

--

This image is refered from Google Cloud official icons.

Introduction

Note: This feature is currently in preview. Please exercise caution when using it in a production environment.

Google reeleased the scheduling notebook function on June 13, 2024. This exciting tool allows for streamlined data processing within BigQuery. Let’s explore this feature with a hands-on approach.

How to Schedule Notebooks

Enable Related APIs

Before you can use the BigQuery notebook, ensure the following APIs are enabled:

  • Compute Engine API
  • Dataform API
  • Vertex AI API

Create Runtime Template

It’s advisable to create a Runtime Template beforehand for executing BigQuery notebooks. While you can launch a new runtime using the default template, it uses the default Virtual Private Cloud (VPC) and doesn’t allow you to choose the instance types.

To setup the template, configure it from “Colab Enterprise” in the Google Cloud console.

Navigate to “RUNTIME TEMPLATES” and click on “NEW TEMPLATE”.

You can also configure it through Terraform, if you’re interested.

Create Notebook

You can start a notebook through BigQuery Studio.

As shown in the examples within the notebook, you can execute both Python codes and SQL queries.

# Running this code will query a table in BigQuery and download
# the results to a Pandas DataFrame named `results`.
# Learn more here: https://cloud.google.com/bigquery/docs/visualize-jupyter

%%bigquery results --project <YOUR_PROJECT>
SELECT * FROM `bigquery-public-data.ml_datasets.penguins`
# With BigQuery DataFrames, you can use many familiar Pandas methods, but the
# processing happens BigQuery rather than the runtime, allowing you to work with larger
# DataFrames that would otherwise not fit in the runtime memory.
# Learn more here: https://cloud.google.com/python/docs/reference/bigframes/latest

import bigframes.pandas as bf
bf.close_session()
bf.options.bigquery.location = "US" #this variable is set based on the dataset you chose to query
bf.options.bigquery.project = "YOUR_PROJECT" #this variable is set based on the dataset you chose to query

df = bf.read_gbq("bigquery-public-data.ml_datasets.penguins")

Notice in Python code that we use bigframes.pandas to extract and shape the dataset as a DataFrame. Here’s an explaination of the differences between bigframes.pandas and pandas .

bigframes.pandas

  • Dataform converts the pandas code to BigQuery SQL and executes it on BigQuery computing resources.
  • When you loading large datasets, you don’t need to worry about out-of-memory issues in the notebook’s runtime.
  • Be mindful of the computing fees associated with BigQuery.
  • This library lacks some functions present in the original pandas .

As shown in the notebook, written code is executed on BigQuery:

And it’s converted to BigQuery SQL.

pandas

  • You have access to the full functionality of pandas .
  • The code is executed in the runtime’s memory.
  • The computing fee is only runtime’s running fee, but you might encounter out-of-memory errors when loading large datasets.

Schedule the Notebook

Let’s get back on track and set up the notebook schedule. Before configuring the schedule, ensure the following roles are granted to your service account:

Additionally, you must grant the following role to the default Dataform service account:

The default Dataform service account is formatted like this: service-<YOUR_PROJECT_NUMBER>@gcp-sa-dataform.iam.gserviceaccount.com .

After granting the roles to both service accounts, you can configure the schedule by clicking the “SCHEDULE” button.

Note: You cannot select the runtime template to configure the scheduled notebook. This means the runtime template uses the default template, requiring the default VPC network. If you have deleted the default VPC network, you must recreate it and configure automatic subnet creation mode again.

The configured scheduled notebook will appear like this:

And past executions will be displayed like this:

Conclusion

This article introduced the recently released Schedule Notebook function. While still in the preview stage and having some limitations, it is a powerful tool for processing data in BigQuery. The notebook allow us to write in both SQL and Python, expanding its usability. Additionally, Schedule Notebook executes cells one by one like a pipeline flow to process the data, which is note provided in scheduled queries. Moreover, extensive Python libraries enable scientific calculations and machine learning processes.
This release is quite exciting, and I can’t wait for it to reach General Availability (GA) stage!

--

--

Yusuke Enami(Kishishita)

I'm now finding a DevOps job abroad! I love Google Cloud/Kubernetes/Machine Learning/Raspberry Pi and Workout🏋️‍♂️ https://bigface0202.github.io/portfolio/