Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Querying and Filtering Nested and Repeated Fields in BigQuery

--

Image of nested boxes stacked on top of each other.  There are multiple smaller boxes scattered throughout the image. The image represents nested fields in a database such as BigQuery.

BigQuery is Google’s serverless, highly scalable, and cost-effective data warehousing service. It excels at analyzing massive datasets with SQL and has seamless integration with Data and AI services in Google Cloud including the ability to query structured and semi-structured data stored outside of BigQuery through External Tables such as files in Google Cloud Storage (GCS) or data in Google BigTable, federated access to query data stored in Google-managed databases such as AlloyDB, Cloud SQL, and Spanner, as well as the ability to directly query data stored in other clouds.

To effectively model and store complex data relationships, BigQuery supports nested and repeated fields within its tables. One of the most common use cases leveraging nested and repeated fields in BigQuery is Google Analytics 4. Other common use cases for nested and repeated fields include ecommerce orders that often have a one-to-many relationship with order items, shipping addresses, and payment details; customer relationship management (CRM) systems that store complex customer data, including contact information, purchase history, and support interactions; and social media platforms that generate vast amounts of data with complex relationships for users, posts, comments, and likes.

In this article, we’ll define and identify nested and repeated fields in BigQuery; explore ways to query, join, and filter on nested and repeated field; address performance considerations; and establish best practices.

Understanding Nested and Repeated Fields

Nested Fields: These are fields within a struct, which is a collection of named fields with different data types. Structs can be nested within other structs, creating complex hierarchies.
Repeated Fields: These are fields that can contain multiple values of the same data type. They are essentially arrays.

Identifying and Differentiating between Nested and Repeated Fields

To better understand the difference between nested and repeated fields, let’s take a look at a table that is part of a Google Analytics 4 export to BigQuery. The following diagram shows a part of the Google Analytics 4 schema as a BigQuery table:

Screenshot of the Google Analytics 4 schema in Bigquery showing Nested and Repeated fields. Nested Fields: event_params: This field is of type RECORD and is REPEATED. This indicates that it’s a repeated field where each element is itself a record (a nested structure). privacy_info: This is a RECORD field, meaning it’s a nested structure containing other fields. user_properties: This is also a RECORD field, indicating a nested structure. user_ltv: This is a RECORD field, indicating a nested struc
Google Analytics 4 BigQuery Schema

Nested Fields:

  • event_params: This field is of type RECORD and is REPEATED. This indicates that it’s a repeated field where each element is itself a record (a nested structure).
  • privacy_info: This is a RECORD field, meaning it’s a nested structure containing other fields.
  • user_properties: This field is of type RECORD and is REPEATED. This indicates that it’s a repeated field where each element is itself a record (a nested structure).
  • user_ltv: This is a RECORD field, indicating a nested structure.
  • device: This is a RECORD field, indicating a nested structure.

Repeated Fields:

  • event_params: As mentioned earlier, this field is both nested and repeated.
  • user_properties: This field is REPEATED, meaning it can contain multiple records (user properties).

Querying Nested and Repeated Fields

There are a number of ways to query nested and repeated fields depending on the level of data that you are accessing as well as impacts on query performance. We’ll explore the most common ways for querying this data.

Entire nested field

Specify the top-level field name to return all fields within a struct. For example:

SELECT  
event_date,
event_timestamp,
event_name,
device
FROM `cloud-advize.analytics_360866743.events_*`

The resulting query returns all of the fields nested under the device field. Use this method when you simply need to return all data nested under a specific field.

BigQuery results from the query SELECT event_date, event_timestamp, event_name, device FROM `cloud-advize.analytics_360866743.events_*`

Dot Notation

Use dot notation to access individual fields within structs. For example:

SELECT  
event_date,
event_timestamp,
event_name,
device.category,
device.operating_system,
device.language
FROM `cloud-advize.analytics_360866743.events_*`

The resulting query returns only the 3 specified fields under the device field — category, operating_system, and language. Use this method when you simply need to return all data nested under a specific field. Unlike the previous query, only the individual field name is returned (category, operating_system, language) as opposed to the “full path” of the nested field (device.category, etc…).

BigQuery results from the query SELECT event_date, event_timestamp, event_name, device.category, device.operating_system, device.language FROM `cloud-advize.analytics_360866743.events_*`

This method also works with multiple layers of nested fields, such as device.web_info.browser:

SELECT  
event_date,
event_timestamp,
event_name,
device.web_info.browser
FROM `cloud-advize.analytics_360866743.events_*`

Use this method when you only need to return certain fields within a nested field.

UNNEST for Complex Queries

But what if we need to return a nested field with multiple levels that include repeated fields? While we can specify the parent field to return all child fields, we cannot specify individual fields within a nested and repeated field. Instead, we need to use the UNNEST command to effectively “join” the nested data into the query so we can select individual fields:

SELECT  
event_date,
event_timestamp,
event_name,
ep.key,
ep.value.string_value
FROM `cloud-advize.analytics_360866743.events_*`,
UNNEST(event_params) as ep

As you can see, we UNNEST the event_params repeated field and join it back into the query. Now we can select individual “child” fields of event_params, which we give the alias of ep.

Bigquery results from the query SELECT event_date, event_timestamp, event_name, ep.key, ep.value.string_value FROM `cloud-advize.analytics_360866743.events_*`, UNNEST(event_params) as ep

Use this approach when querying data from complex nested and repeated fields.

Filtering Nested Fields

Going back to our 2nd querying example, using dot notation, we can apply a similar approach to filtering on simply nested fields using the WHERE clause. Let’s retry the same query, adding a WHERE statement to filter for the device.category is equal to “desktop” (!=):

SELECT  
event_date,
event_timestamp,
event_name,
device.category,
device.operating_system,
device.language
FROM `cloud-advize.analytics_360866743.events_*`
WHERE device.category != "desktop"

We can see that the results returned from this query now remove the device.category of “desktop”, showing “mobile” devices:

Bigquery result from the query SELECT event_date, event_timestamp, event_name, device.category, device.operating_system, device.language FROM `cloud-advize.analytics_360866743.events_*` WHERE device.category != “desktop”

Use this method when you need to filter on a field in a simply nested field.

Filtering Repeated Fields

Similar to our approach to complex queries, repeated fields require an additional step to UNNEST the repeated field and join it back in to the query. But how do we do this in the WHERE clause? A common approach is to construct a subquery within the WHERE clause using the EXISTS statement.

SELECT  
event_date,
event_timestamp,
event_name,
event_params
FROM `cloud-advize.analytics_360866743.events_*`
WHERE EXISTS (
SELECT 1
FROM UNNEST(event_params) as event_params
WHERE event_params.key = "page_location"
AND event_params.value.string_value = "http://localhost/"
)

Although we are returning all event_params, we are filtering for only those records that have the key-value pair of event_params.key = “page_location” and event_params.value.string_value = “http://localhost/”.

Bigquery results from the query SELECT event_date, event_timestamp, event_name, event_params FROM `cloud-advize.analytics_360866743.events_*` WHERE EXISTS ( SELECT 1 FROM UNNEST(event_params) as event_params WHERE event_params.key = “page_location” AND event_params.value.string_value = “http://localhost/” )

Use this method when you need to filter on certain fields within a repeated nested field.

Filtering with Multiple Conditions

There may be times when you need to filter on multiple conditions across both nested and repeated fields. In this case, we can combine the 2 apporaches above by applying both the dotted notation and UNNEST approach:

SELECT  
event_date,
event_timestamp,
event_name,
device.language,
event_params
FROM `cloud-advize.analytics_360866743.events_*`
WHERE EXISTS (
SELECT 1
FROM UNNEST(event_params) as event_params
WHERE event_params.key = "page_location"
AND event_params.value.string_value = "http://localhost/"
)
AND device.language = "en-us"

We can see in the results, that the additional filter of device.language = “en-us” has been applied (we also added device.language to the SELECT statement just to show in the results).

Performance Considerations

Avoid Excessive Unnesting:

Unnesting large arrays can significantly impact query performance. If possible, filter the array before unnesting.

Best Practices

Clear and Concise Naming:

Use descriptive names for nested fields to improve code readability.

Document Your Schema:

Clearly document the structure of your nested data to make it easier for others to understand and query.

Conclusion

Effectively querying and filtering nested and repeated fields in BigQuery is crucial for extracting meaningful insights from complex datasets. By understanding the structure of these fields and employing appropriate techniques like dot notation and the UNNEST operator, you can efficiently access and analyze the information you need. Remember to consider performance implications, especially when dealing with large datasets, and always strive for clear and concise data modeling and documentation to ensure smooth and successful data analysis.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

No responses yet