Extracting JSON key value pairs in BigQuery

Stephan Meyn
Google Cloud - Community
3 min readOct 24, 2022

Short Intro to JSON in BigQuery

Bigquery introduced the ablity to deal with JSON structures a while ago. It supports both string values that represent JSON structures as well as a native JSON data type. You can find a good introduction here.

Bigquery offers the ability to query JSON (and strings with JSON) using path statements such as this:

JSON_EXTRACT(json_field, "$.path.to.my.data") 

Or short:

jsonfield.path.to.my.data

There is a number of convenience function for extracting data out of JSON but there are also a few constraints.

The Challenge

Recently I had a situation where we needed to extract entries in a json dictionary.

{
"temp": 30, "elevation":120
}
{
"Ambience": "high", "sound":"fair"
}
{
"temp": 10, "sound":"bad"
}

The naive way would be to use:

JSON_EXTRACT(json_expr, "$.temp")

Unfortunately the keys were unknown at runtime, so it was impossible to code a json path.

A first approach started out with a list of possible keys:

DECLARE wanted_keys array<STRING> 
DEFAULT ARRAY["temp","Ambience", "sound"];

Then one could iterate over wanted_keys to extract the values

SELECT  
city_record.city,
wanted_key.name,
JSON_EXTRACT(city_record.payload, '$.'||wanted_key.name))

As it turns out, we cannot dynamically create a path argument in the JSON_EXTRACT function. It has to be a constant or a query parameter (i.e. it needs to be constant).

So we introduced a custom Java Script to extract a value based on a dynamic key:

CREATE temp  FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING) RETURNS STRING
LANGUAGE js AS """
try {
var obj = JSON.parse(json);
var result;
var arg;
if (obj) {
if (json_path && obj.hasOwnProperty(json_path)) {
result = JSON.stringify(obj[json_path]);
}
return (result.length ? result : false);
}
} catch (e) { return null }
""";

The query can now iterate over both the data records and the wanted keys and extracts one row at a time:

FOR city_record in (SELECT  city, payload from DATA)
DO
FOR wanted_key in (SELECT * from unnest(wanted_keys) as name)
DO
IF city_record.payload like '%"'||wanted_key.name||'"%'
THEN
INSERT INTO results (
SELECT city_record.city, wanted_key.name,
CUSTOM_JSON_EXTRACT(city_record.payload, wanted_key.name));
END IF;
END FOR;
END FOR;

This works fine, however this has one glaring problem: every INSERT INTO results statement inside this double loop is a Bigquery job! If you process 1000 rows with 3 possible keys you will be starting up to 3000 bigquery jobs. And they will be executed sequentially. All of Bigquery’s power of parallel processing is gone!

A Better Solution

The better solution turned out to:

  • create a javascript that turns the dictionary into a list of key value pairs
  • cross join these with the original row to create one row per key
  • join these rows with the wanted_keys list
-- extract all key value pairs as an array from a json dict
-- input: json string with a dictionary
-- returns: list of struct <key, value>
CREATE TEMP FUNCTION EXTRACT_KV_PAIRS(json_str STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
try{
const json_dict = JSON.parse(json_str);
const all_kv = Object.entries(json_dict).map(
(r)=>Object.fromEntries([["key", r[0]],["value",
JSON.stringify(r[1])]]));
return all_kv;
} catch(e) { return [{"key": "error","value": e}];}
""";
-- make our sample data
WITH
make_table AS(
SELECT "NYC" as City, '{"temp": 30, "elevation":120}' as payload
UNION ALL
SELECT "SF" , '{"Ambience": "high", "sound":"fair"}'
UNION ALL
SELECT "LA" , '{"temp": 10, "sound":"medium"}'
),
-- make the sample wanted keys
wanted_keys AS(
SELECT * FROM
UNNEST(ARRAY["temp","Ambience", "sound"])
AS wanted
),
-- turn the dictionary into a list
extracted_pairs as (
SELECT city,
EXTRACT_KV_PAIRS(payload) as kv_list
FROM make_table
),
-- now expand the list into individual rows
all_pairs AS(
SELECT city, kv_pair.key AS key, kv_pair.value AS value
FROM extracted_pairs
CROSS JOIN UNNEST (kv_list) as kv_pair
)
-- now restrict it to those rows
-- that match the keys we are looking for
SELECT city, key, value
FROM all_pairs
JOIN wanted_keys on key = wanted

This solution runs at full BigQuery speed. When the previous solution would run in the range of minutes and even hours, this solution runs in a few seconds

--

--