BigQuery Adventures in Structure Land

Learn how to create and change structures in BigQuery

Lee Doolan

Published in

Appsbroker CTS Google Cloud Tech Blog

8 min readNov 29, 2022

https://labs.openai.com — futuristic picture of structure

Introduction

In a modern database, data is no longer stored in simple scalar column formats. Often it’s stored in formats akin to JSON, where deep structures of arrays and mapped properties are nested within each other.

This functionality is a great advance in data management and luckily a modern data platform like BigQuery contains all the functions we need to get at the data we need.

You can read more on how to work with arrays and structs here, and references to the array functions here, using the as-always excellent Google documentation.

When working with clients we often experience scenarios where incoming data needs to be restructured to a more usable format, and preferably within BigQuery itself.

In this post I will detail how we can first create a sample schema from a very raw data format, and then manipulate to achieve different required outputs using several techniques and functions.

Sample Data

For the purposes of this blog we will use some very current (at the time of writing!) and public data, in this case the very recent first round matches of the group stage of the 2022 World Cup in Qatar. To summarise this will the equate to 16 football matches across 5 match days.

Raw Data

The raw data we will use is as below:

2022-11-20|16:00:00|Qatar|0|Ecuador|2#2022-11-21|13:00:00|England|6|Iran|2#2022-11-21|16:00:00|Senegal|0|Netherlands|2#2022-11-21|19:00:00|USA|1|Wales|1#2022-11-22|10:00:00|Argentina|1|Saudi Arabia|2#2022-11-22|13:00:00|Mexico|0|Poland|0#2022-11-22|16:00:00|Denmark|0|Tunisia|0#2022-11-22|19:00:00|France|4|Australia|1#2022-11-23|10:00:00|Morocco|0|Croatia|0#2022-11-23|13:00:00|Germany|1|Japan|2#2022-11-23|16:00:00|Spain|7|Costa Rica|0#2022-11-23|19:00:00|Belgium|1|Canada|0#2022-11-24|10:00:00|Switzerland|1|Cameroon|0#2022-11-24|13:00:00|Uruguay|0|South Korea|0#2022-11-24|16:00:00|Portugal|3|Ghana|2#2022-11-24|19:00:00|Brazil|2|Serbia|0

This is single line string where each match is split by a #, and required fields split by a |.

# Example single match record, field delimiter = |, row delimiter = #
2022–11–20|16:00:00|Qatar|0|Ecuador|2#

# maps to these 6 fields:
Match Date|Match Time|Country_1|Country_1_Score|Country_2|Country_2_Score#

Sample Data Structure

We will convert this raw data to our ‘base’ sample data structure which we will use for our follow-on scenario examples.

We will load our raw data into this deliberately rather cumbersome schema as below, containing 3 repeated, ordered but unlinked, array fields, for the match time, and one for each match participant country.

SQL Code
I include the SQL below which you can use to create the base sample data table for the later scenario examples.

CREATE OR REPLACE TABLE `your-project-id.your-dataset.sample_data`
AS
WITH 
raw_data AS (
  SELECT '2022-11-20|16:00:00|Qatar|0|Ecuador|2#2022-11-21|13:00:00|England|6|Iran|2#2022-11-21|16:00:00|Senegal|0|Netherlands|2#2022-11-21|19:00:00|USA|1|Wales|1#2022-11-22|10:00:00|Argentina|1|Saudi Arabia|2#2022-11-22|13:00:00|Mexico|0|Poland|0#2022-11-22|16:00:00|Denmark|0|Tunisia|0#2022-11-22|19:00:00|France|4|Australia|1#2022-11-23|10:00:00|Morocco|0|Croatia|0#2022-11-23|13:00:00|Germany|1|Japan|2#2022-11-23|16:00:00|Spain|7|Costa Rica|0#2022-11-23|19:00:00|Belgium|1|Canada|0#2022-11-24|10:00:00|Switzerland|1|Cameroon|0#2022-11-24|13:00:00|Uruguay|0|South Korea|0#2022-11-24|16:00:00|Portugal|3|Ghana|2#2022-11-24|19:00:00|Brazil|2|Serbia|0' AS txt
  )
SELECT 
CAST(row_fields[ORDINAL(1)] AS DATE) AS Match_Date,
ARRAY_AGG(CAST(row_fields[ORDINAL(2)] AS TIME)) AS Match_Time,
ARRAY_AGG(
  STRUCT(
    row_fields[ORDINAL(3)] AS Name, 
    CAST(row_fields[ORDINAL(4)] AS INT64) AS Score
  )) AS Country_1,
ARRAY_AGG(
  STRUCT(
    row_fields[ORDINAL(5)] AS Name, 
    CAST(row_fields[ORDINAL(6)] AS INT64) AS Score
  )) AS Country_2
FROM (  SELECT SPLIT(TRIM(match_rows),"|") row_fields 
        FROM raw_data, 
        UNNEST(SPLIT(txt,'#')) AS match_rows) x
GROUP BY Match_Date
ORDER BY Match_Date;

When you have created the structured sample data table using the code above, let’s pretend we never even had the data in it’s raw format, purely for the learning purposes!

Explanation
Converting the raw data string into this structure involves many elements worth knowing. I’ve hyperlinked each concept to the official Google documentation for convenience.

A Common Table Expression (CTE) is used to include the raw data string and reduce the overall complexity of the query.
Using UNNEST(SPLIT(txt, ’#’)) AS match_rows in the From clause, we can split the raw data single line text into the 8 rows we need per Match. The inner Split command converts the string into an array, and the outer Unnest function converts the array to rows.
The SPLIT(TRIM(match_rows),”|”) row_fields is used to split each match string into the 6 fields we need into a 6 element array.
We can then use this command row_fields[ORDINAL(x)] to access each field, where x denotes the field position.
We will use the ARRAY_AGG function to roll-up each of our matches into a row per Match day. This function will then require a Group By clause.
The STRUCT function will be used to define a data type for each country involved in a match containing a field for the country Name and Score.

Scenarios

Using this badly structured data table I am now going to describe several scenarios, and how to solve them using BigQuery built-in functions.

Scenario 1 / Flatten to 16 rows & reduce to 2 columns

In the first scenario we will flatten this data down to 16 rows, i.e. a row per match played. We will also reduce down to 2 columns, one representing the Match Date & Time, the other representing a textual description of the match result.

SQL Code
Below is SQL we can use to convert the base sample table to the required scenario output:

WITH
unnested_data AS
  (SELECT 
  Match_Date, 
  Match_Time_Unnested AS Match_Time,
  Country_1[OFFSET(idx)] C1,
  Country_2[OFFSET(idx)] C2
  FROM `your-project-id.your-dataset.sample_data`,
  UNNEST(Match_Time) AS Match_Time_Unnested WITH OFFSET AS idx
  )
SELECT DATETIME(Match_Date, Match_Time) AS Match_DateTime,
CONCAT(C1.Name, ' ', C1.Score, ' ', C2.Name, ' ', C2.Score) AS Match_Descr
FROM unnested_data;

Explanation
The first thing we want to do is explode the existing 5 row match day dataset into the 16 rows representing each individual match. Usually we could do this with a simple unnest, but in this instance we have 3 arrays we need to zip together.

I detail each new complex step below, which I haven’t detailed previously.

We explode the rows into 16 by unnesting the Match Time array with this command UNNEST(Match_Time) AS Match_Time_Unnested WITH OFFSET AS idx. Note the addition of the ‘with offset’ clause, we will use this later for zipping the other arrays.
We will then use the offset field namely ‘idx’ to pull out each element of the other arrays i.e. for the Country_1 array, Country_1[OFFSET(idx)].

Scenario 2 / Flatten to 32 rows & reduce to 4 columns

In this scenario we will flatten this data down to 32 rows, i.e. a row per match played by each country and their score. We will also reduce down to 4 columns, representing the Match Date, Match Time, Country Name and Country Score.

This will be a very granular format enabling us to then quickly answer many queries such as, which match date and/or time slot delivers the most goals? etc

SQL Code
Below is SQL we can use to convert the base sample table to the required scenario output:

WITH
unnested_data AS
  (SELECT 
  Match_Date, 
  Match_Time_Unnested AS Match_Time,
  Country_1[OFFSET(idx)] C1,
  Country_2[OFFSET(idx)] C2
  FROM `your-project-id.your-dataset.sample_data`,
  UNNEST(Match_Time) AS Match_Time_Unnested WITH OFFSET AS idx
  )
SELECT Match_Date, Match_Time,
C1.Name AS Country, C1.Score AS Score
FROM unnested_data
UNION ALL 
SELECT Match_Date, Match_Time,
C2.Name AS Country, C2.Score AS Score
FROM unnested_data;

Explanation
No in-depth explanation is really needed here as it uses the same CTE to unnest and zip the arrays together as the first scenario.

The final data is the delivered with a UNION ALL statement to explode the match rows into 32 representing each country and their score.

Scenario 3 / Keep 5 rows but reduce other structures to single field

In our final scenario we will reduce the data down to 2 columns, one representing the Match Date, and the other concatenating all the countries together into a comma separated single string field, i.e. to indicate which teams played on each date. We will discard the Match Time and Score fields.

This scenario will be unique in that we will be actually manipulating the data into a less useful structure from a data point of view, simply to demonstrate further techniques and because we can!

SQL Code
Below is SQL we can use to convert the base sample table to the required scenario output:

SELECT Match_Date,
ARRAY_TO_STRING(ARRAY(
  (SELECT Name FROM UNNEST(Country_1)
  UNION ALL
  SELECT Name FROM UNNEST(Country_2)
  ORDER BY Name)
), ', ') AS Country_List
FROM `your-project-id.your-dataset.sample_data`
ORDER BY Match_Date;

Explanation
This structure is built by first creating a new array per Match Date containing all of the teams played that date, followed by a new command to flatten an array down to a string with a given delimiter. Both are detailed below with links to the Google docs:

The new array is created using the ARRAY function and using a SQL query as parameter to unnest each Country structure, pull out the Name field only, union them together, and finally order the elements alphabetically by the country name.
This array is then rolled up into a single string using the ARRAY_TO_STRING function with a second parameter providing the field delimiter to use, in this case as comma.

Conclusion . . .

Working with records and arrays in BigQuery seems difficult initially but it’s just another way of thinking about data and how it’s stored.

It can be very easy to create Cartesian joins or get yourself in a muddle as you try and unnest multiple arrays, or worse still deep multi-layered nested arrays, ouch!!

Knowing when and where to unnest, and becoming familiar with all the array functions and syntax needed will help expedite your query writing and ensure you write your SQL efficiently.

. . . And further thoughts

All of the examples above make use of fairly established BigQuery data types i.e. repeated fields and records, where a fixed schema is known and exists.

But what if we don’t have a fixed schema? It’s unknown initially? Or it changes? How can we load this data into BigQuery and query it?

In the past I’ve got around this by dumping full JSON structures into a single field and performing schema-on-read queries using BigQuery functions like JSON_QUERY and JSON_VALUE, to pull out elements and unnest arrays etc.

And now BigQuery has recently introduced a new native JSON datatype for this scenario, which they say will make JSON queries easy to use and be cost efficient.

I hope to blog about both these methods in a later post.

Happy unnesting!

P.S Big thanks to my CTS colleagues tim.ellissmith, James NG & Jen Calland for their help in this publication.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

BigQuery Adventures in Structure Land

Learn how to create and change structures in BigQuery

Introduction

Sample Data

Raw Data

Sample Data Structure

Scenarios

Scenario 1 / Flatten to 16 rows & reduce to 2 columns

Scenario 2 / Flatten to 32 rows & reduce to 4 columns

Scenario 3 / Keep 5 rows but reduce other structures to single field

Conclusion . . .

. . . And further thoughts

Written by Lee Doolan