Leverage the power of BigQuery Studio with Notebooks, Spark Procedures and Natural language queries

Published in

Data Reply IT | DataTech

7 min readJan 29, 2024

A brief introduction to BigQuery

BigQuery is a fully managed, serverless data warehouse that enables businesses to Big Data. Is one of the main services offered by Google Cloud Platform and the first choice when you need to analyze large datasets at scale.

More than a simple data warehouse. It has some built-in features that go over the functionalities of classic data warehouse service. It enables business intelligence, machine learning, real-time analytics with streaming data pipelines, and geospatial analysis. From now Big-Query studio will permit you to create notebooks, create stored procedures for Apache Spark, and query your data even with natural language thanks to AI assistance.

What’s inside the new BigQuery Studio?

BigQuery Studio is a new workspace for data analytics. The purpose is to unify different services and ways to work in data analytics in one purpose-built platform. To accelerate the process of data ingestion/preparation , analysis, exploration, and visualization are provided new interesting services. Thanks to the new notebook interface, we will see the new main features associated with data exploration and visualization in BigQuery along with its integration with Apache Spark for stored procedures. Additionally, there are new enhancements focusing on obtaining deeper governance insights through data lineage, profiling, and quality — all seamlessly integrated within BigQuery.

Query your data

Within BigQuery now, you can query your data in new and different ways. Either to SQL in the classic SQL editor, you can explore your data in Python Notebooks through the interface powered by Colab Enterprise ( still in preview ), and write SQL/Apache Spark stored procedures. Finally, the new AI assistance will help you translate your request from natural language to SQL code for BigQuery.

Python Notebooks in BigQuery

Notebooks are BigQuery Studio code assets powered by Dataform. You can use notebooks to complete analysis and machine learning (ML) workflows by using SQL and Python. Notebooks offer improved collaboration and management with the following options:

Share notebooks with specific users.
Review the notebook version history.
Revert to or branch from previous versions of the notebook.

BigQuery utilizes Vertex AI runtimes to execute notebooks. A notebook runtime is a Compute Engine virtual machine assigned to a specific user to enable code execution within a notebook. While multiple notebooks can share the same runtime, each runtime is exclusive to a single user and cannot be accessed by others.

To query and explore BigQuery data at scale, you can use the Python BigQuery DataFrames APIs, which are based on pandas DataFrame and Scikit-learn API.

bigframes.pandas provides a pandas compatible API for analytics.
bigframes.ml provides a scikit-learn like API for ML.

Each BigQuery DataFrames object is associated with a specific BigQuery session, which automatically terminates. When this happens, you can’t use previously created DataFrame or Series objects. To address this, re-create them using a new BigQuery DataFrames session.

In a Notebook, you can read your table’s data in the following way:

with the magic command %%bigquery specifying the name of the Pandas Dataframe that will have the query results.
with the bigframe.pandas api ( e.g. read_gbq(“project.dataset.table” )

Big query Notebook — read BigQuery table

Another option should be to use the client library for Python ( e.g. bigquery.Client() ). This is the default approach used behind the scenes by the %%bigquery magic command to run the given query in the notebook’s cell and convert the results to a pandas DataFrame. Notice, in the previous image, the difference of datatypes for the table read through the 2 different approaches.

Notebooks in BigQuery enable you to explore and plot your data fast without any effort in dealing with multi-service interconnections.

BigQuery Notebook — plot yuor BigQuery tables

Stored Procedures for Apache Spark

A stored procedure is a collection of statements that can be called from other queries or other stored procedures. A procedure can take input arguments and return values as output. You name and store a procedure in a BigQuery dataset. Using BigQuery, you can create Spark stored procedures that are written in Python, Java, and Scala. You can then run these stored procedures in BigQuery like running normal SQL stored procedures. To create a Spark Stored procedure is necessary to create before a Spark connection.

BigQuery offers two approaches for creating a stored procedure for Spark in Python:

With the “CREATE PROCEDURE” statement, in the query editor.
With the PySpark editor.

Here is an example of how to create a stored procedure for Spark in the classic query editor, using the connection “my-project-id.us.my-connection” and inline PySpark code.

CREATE OR REPLACE PROCEDURE bq_project.dataset.spark_procedure()
WITH CONNECTION `project-id.location.connection_name`
OPTIONS(engine="SPARK", runtime_version="1.1")
LANGUAGE PYTHON AS R"""
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark-bigquery-demo").getOrCreate()

# Load data from BigQuery.
words = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data:samples.shakespeare") \
  .load()
words.createOrReplaceTempView("words")

# Perform word count.
word_count = words.select('word', 'word_count').groupBy('word').sum('word_count').withColumnRenamed("sum(word_count)", "sum_word_count")
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format("bigquery") \
  .option("writeMethod", "direct") \
  .save("wordcount_dataset.wordcount_output")
"""

As an alternative to writing the PySpark code inline, you can specify in the OPTIONS list, the main_file_uri field that should be the path to a python file saved in a Google Cloud Storage bucket with the body of the stored procedures. As a recommendation use this approach if the body is more than 1 MB. You can either change the LANGUAGE in Scala or Java, the main_file_uri field will point to the path of a JAR file that has the main class, or populate the fields main_class and jar_uris with the fully qualified name of a class in a JAR set specified in the jar_uris option.

Once the stored procedure is created, you can call it as

CALL `bq_project`.dataset.spark_procedure();

Another option is to use the PySpark editor. Through the Pyspark options, is necessary to set the stored procedure configurations and connection.

Set stored procedures configurations and connection

.. and finally write the PySpark code as is in the provided editor.

define the body of the PySpark procedure in the Pyspark editor

Natural Language queries with Duet AI assistance

Duet AI is the Google Cloud AI-powered code assistance that can complete your code, and generate whole code blocks. In BigQuery you can use it for:

Generate a SQL query
Complete a SQL query
Explain a SQL query

You can provide a natural language statement to generate a SQL query. You begin SQL generation by using the # character in the BigQuery query editor. You then follow the prompt with a natural language statement or question about the data that you want. Duet AI then suggests a query to help you to analyze your data. It also helps you while you are writing the query, forecasting the next SQL statements to complete a SQL query. If you have an already written query you can also ask to the Duet AI assistance in the chat to help you understand the query and to Explain the SQL query.

Unified data type and sources

Thanks to the connection with BigLake and object table, in BigQuery Studio is it possible to work with structured, semi-structured, and unstructured data of all formats. With BigQuery Omni and BigLake tables is even possible to query external data across cloud environments (Google Cloud, AWS Amazon S3, Azure Blob Storage).

CI/CD and Versioning

BigQuery Studio enhances collaborative efforts, by integrating established software development principles. It incorporates key practices such as (CI/CD), version tracking, and source control for various analytical assets, including SQL scripts, Python scripts, notebooks, and SQL pipelines. Furthermore, the platform enables users to establish secure connections with external code repositories.

Data Lineage, Profiling and Quality

BigQuery Studio empowers organizations to extract reliable insights from data by helping users comprehend data, identifying quality issues, and troubleshooting. You can monitor data lineage, profile data, and implement data-quality constraints, ensuring the data is of high quality. BigQuery Studio will introduce personalized metadata insights, such as dataset summaries and recommendations for conducting more in-depth analyses. Moreover, BigQuery Studio provides admin with the ability to consistently enforce security policies minimizing the necessity to duplicate or share data beyond BigQuery. With unified credential management across BigQuery and Vertex AI, policies are enforced for fine-grained security without needing to manage additional external connections or service accounts. For instance, using SQL in BigQuery, data analysts can use Vertex AI’s foundational models for tasks like sentiment analysis and entity detection over BigQuery data, without the need to share data with external services.

Conclusion

BigQuery stands out as a robust service within the Google Cloud Platform. With the introduction of new features, it is set to become even more comprehensive, offering diverse approaches for empowering data analytics, exploration, and visualizations — especially with the implementation of the new Notebook interface. It’s important to note that all the services mentioned in this article are currently in preview, indicating that they may not be functioning at their full capacity, and users might encounter issues or partial service availability. The new features that enables to write Spark stored procedures directly in a PySpark editor in BigQuery is very interesting, maybe in the future it will be possible (I hope) to do classic queries directly in PySpark and not only for stored procedures.

Thanks for reading and happy Query to everyone.