BigQuery Data Warehousekeeping: Nested, Repeated, Arrays, Structs…

Nonna Shakhova
6 min readMay 11, 2023

--

Cookbook with recipes

Cookbook: how to organize messy space in your Data Warehouse

Like you I have a lot of warm memories from my childhood. I used to spend time with my granny in her house. She was a brilliant housekeeper who was out of boring stuff. She was that person who love to cook as the food equals love for her. Despite she was perfect in cooking she was always open to new and followed the best practices. They called sharing knowledge ‘recipes’ that time. I suppose age ago every housewife had her cookbook with her best recipes as well as recipes from others to try something new. Try, taste, share — this is how best practises spread.

I am exactly like her — if I see something which could improve my way of doing, I try and share after. Even this is not about cinnamon buns but about data. That is why I want to take a couple minute of your attention and show how you could do efficient data warehouse schema design.

Now, I will be using the commits table from the Big Query public dataset github_repos. This table contains information on over 2 million open-source GitHub repositories, including the repository name, owner, description, number of stars and forks, as well as information on the languages used in the repository and the contributors to the repository.

INGREDIENTS

Normalized vs Denormalized data

First, let’s take a look at the data we have. Is it fast and easy to fetch as we need it? I am sure, everyone want to store the data in their data warehouse in the most efficient and convenient way.

Access can be achieved by rows, columns, or by rows and then columns. The performance of each approach will vary depending on the query being used.

Transactional databases typically use the normal form, which organizes data in a relational system. Normalization increases the orderliness of the data and is useful for saving space by avoiding data redundancy. This can make it easier to update information and avoid data inconsistencies. However, when querying normalized data, it may require many table joins to retrieve the necessary information, which can lead to slower performance.

Denormalized data, on the other hand, combines related information into a single table or view. This can make querying faster and simpler, but it can also make updates more difficult and may lead to data inconsistencies if not properly managed. Data is repeated rather than being relational.

Flattened data takes up more storage, but the flattened non-relational organization makes queries more efficient because they can be processed in parallel using columnar processing.

Denormalizing data allows BigQuery to distribute processing more efficiently among slots, resulting in more parallel processing and better query performance. Generally, you would denormalize data before loading it into BigQuery.

It is important to note that denormalizing data can have a negative impact on performance in certain cases. Specifically, if you have to group by a column with a one-to-many relationship, such as the “author” column in the denormalized table above.To group the data, it must be shuffled, which can be slow, especially when transferring the data over a network between servers or systems.

So, the good news is that BigQuery has got your back. It’s got this thing where you can make things better by using columns with nested and repeated data.

NESTED and REPEATED fields

The field ‘author’ has ‘RECORD’ type. This declaration allows BigQuery to store and process the data in a way that respects some of the original organization in the data. Specifically, all details about the commit, author, date, difference for each commit are located together, making retrieval of the entire commit more efficient.

Nested and repeated fields are useful for working with data from relational databases. They preserve the original data’s relational qualities and schema while enabling parallel processing of the repeated nested fields. It falls between the costly joins of normalized data and the completely flat denormalization, and this makes these fields the best option for data that already has a relational pattern, and can improve BigQuery performance.

Let us have a look at the schema of the commits table:

You may see the ‘REPEATED’ mode for some fields. So, what is the difference between ‘STRING REPEATED’ and ‘RECORD REPEATED’?

ARRAY and STRUCT

In the commits schema:

  • trailer is an ARRAY of objects, which corresponds to a repeated record.
  • author is an STRUCT (or structured) data type fields

ARRAY can consist of any single type, such as an array of floats or an array of strings, among others, when STRUCT can contain data with different data types.

In BigQuery, an ARRAY is a data type that allows you to store an ordered list of values. You can think of an array as a single column that contains multiple values. Arrays can be either a regular field or part of a nested field nestled inside of a struct.

In contrast, a STRUCT groups together related fields, similar to a record or struct in programming languages. A struct can contain other structs nested inside.

Using structs can be particularly useful when you have a set of columns that you frequently join together. Instead of joining the tables every time you need to use those columns, you can create a struct containing those columns and store it as a single field in a table. This can significantly improve query performance, especially when you have many tables that you need to join together.

To see how data organized with it run the following query:

SELECT * FROM `bigquery-public-data.github_repos.commits` LIMIT 1

And switch to JSON view after:

You will see how the data well structured in the one place:

{
"commit": "47b50a70ad2020c25c0a6c04fa99badf1b59e39d",
"tree": "b3769d333429e596524dc78e3841008f7af2ae74",
"parent": ["3b8bbd3b2b23d1ade0ff3d73834de813a4b20806"],
"author": {
"name": "markster",
"email": "d94b2f759618182e6ebc91f606c23daf0a917827@f38db490-d61c-443f-a65b-d21fe96a405b",
"time_sec": "942646144",
"tz_offset": "0",
"date": {
"seconds": "942646144",
"nanos": null
}
},
"committer": {
"name": "markster",
"email": "d94b2f759618182e6ebc91f606c23daf0a917827@f38db490-d61c-443f-a65b-d21fe96a405b",
"time_sec": "942646144",
"tz_offset": "0",
"date": {
"seconds": "942646144",
"nanos": null
}
},
"subject": "Version 0.2.0 from FTP",
"message": "Version 0.2.0 from FTP\n\n\ngit-svn-id: f00722e08ab9d6ebab9de8a5201f02b8dc0c8f22@47 f38db490-d61c-443f-a65b-d21fe96a405b\n",
"trailer": [{
"key": "git-svn-id",
"value": "f00722e08ab9d6ebab9de8a5201f02b8dc0c8f22@47 f38db490-d61c-443f-a65b-d21fe96a405b",
"email": "f00722e08ab9d6ebab9de8a5201f02b8dc0c8f22@47 f38db490-d61c-443f-a65b-d21fe96a405b"
}],
"difference": [],
"difference_truncated": null,
"repo_name": ["pruiz/asterisk"],
"encoding": null
}

RECIPES

If you have never used these types and want to try them right now, follow the tutorial above.

It this tutorial you may meet UNNEST() function. The UNNEST function in BigQuery is used to flatten arrays into individual rows. It can be particularly useful for analyzing data with nested or repeated fields. For instance, if you have a table with a column that contains an array of values, you can use UNNEST to create a new row for each value in the array. This allows you to analyze the data at a more granular level.

If you enjoyed the tutorial, you may be interested in optimizing query performance and lowering query costs by designing the schema of your tables in some of the ways. It’s much more efficient to define your schema using nested repeated fields instead of joins.

When dealing with dimension tables that are smaller than 10 gigabytes, it is best to keep them normalized. If the tables undergo few update and delete operations, normalization is especially beneficial. However, if you are unable to define your schema in terms of nested repeated fields, you must make a decision on whether to keep the data in two tables or denormalize the tables into one big flattened table.

Usually, when your tables are smaller than 10 gigabytes, it is recommended that you keep them separate and perform a join.

Learn to think in terms of arrays, but bear in ming that this is not the ‘silver bullet’ — be creative to built the best schema for your data warehouse!

--

--

Nonna Shakhova

Data Engineer. Former Data Scientist. Women Techmakers Ambassador.