Transforming Nested JSONs into an ARRAY of STRUCTs in BigQuery

Leigha Jarett
Google Cloud - Community
2 min readJun 3, 2021

Sometimes your data lands in BigQuery as a nested JSON string. One example might be an a column that has one key for each item — with details about the item nested as shown below.

{
"item_1": {
"cost": 100,
"name": "My Product",
"category": "My Category"
},
"item_2": {
"cost": 150,
"name": "My Other Product",
"category": "My Other Category"
}
}

This format can be difficult to work with in SQL. Let’s consider looking at the average cost for each item category. To extract the nested values, we need to know the actual key we’re indexing. For example, if we want to grab the cost for item_1 we could leverage the JSON_EXTRACT_SCALAR function:

SELECT JSON_EXTRACT_SCALAR(items, '$.item_1.cost')
FROM my_table

But to grab all of the category and cost fields we would need to know each item key, which can be tedious or even unreasonable depending on the number of items.

SELECT category, AVG(cost) as average_cost
FROM
(SELECT
JSON_EXTRACT_SCALAR(items, '$.item_1.category') as category,
JSON_EXTRACT_SCALAR(items, '$.item_1.cost') as cost,
FROM my_table
UNION ALL
SELECT
JSON_EXTRACT_SCALAR(items, '$.item_2.category') as category,
JSON_EXTRACT_SCALAR(items, '$.item_2.cost') as cost,
FROM my_table)
GROUP BY 1

What we really want is to restructure our data so that instead of having nested JSONs, we have an ARRAY of STRUCTS.

[
{
"cost": 100,
"name": "My Product",
"category": "My Category"
},
{
"cost": 150,
"name": "My Other Product",
"category": "My Other Category"
}
]

An ARRAY represents a repeated field, meaning each object in the list is kind of like its own record (or row) in a table. When our data is restructured into an ARRAY, we can easily leverage the UNNEST function to essentially split each object into its own row — which we can then query using standard SQL.

SELECT category, cost
FROM my_table, UNNEST(items)

If there is some common naming structure we could potentially get by using the SPLIT function with some regex to break each item JSON into its own object in the ARRAY. But this might not work for every use case.

A better option may be to use a user defined function to iterate through each key in the JSON. Below I defined a temporary Javascript UDF that grabs all the keys in the object (e.g. item_1, item_2) and iterates through each one to select the JSON and add it to an array.

I didn’t want to have to define the schema for the nested STRUCT so I actually just left the output as an ARRAY of STRINGs. But, we can use BigQuery’s JSON functions to grab the field we’re interested in and aggregate up like we would if the STRING was itself a STRUCT!

CREATE TEMP FUNCTION unnest_json(str STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
var obj = JSON.parse(str);
var keys = Object.keys(obj);
var arr = [];
for (i = 0; i < keys.length; i++) {
arr.push(JSON.stringify(obj[keys[i]]));
}
return arr;
""";
SELECT
JSON_EXTRACT_SCALAR(itms,'$.category') as category, AVERAGE(JSON_EXTRACT_SCALAR(itms,'$.cost')) as cost from my_table, UNNEST(unnest_json(items)) as itms
GROUP BY 1

For more BigQuery content, follow me on LinkedIn and Twitter @leighajarett!

--

--