Extracting JSON key value pairs in BigQuery
Short Intro to JSON in BigQuery
Bigquery introduced the ablity to deal with JSON structures a while ago. It supports both string values that represent JSON structures as well as a native JSON data type. You can find a good introduction here.
Bigquery offers the ability to query JSON (and strings with JSON) using path statements such as this:
JSON_EXTRACT(json_field, "$.path.to.my.data")
Or short:
jsonfield.path.to.my.data
There is a number of convenience function for extracting data out of JSON but there are also a few constraints.
The Challenge
Recently I had a situation where we needed to extract entries in a json dictionary.
{
"temp": 30, "elevation":120
}
{
"Ambience": "high", "sound":"fair"
}
{
"temp": 10, "sound":"bad"
}
The naive way would be to use:
JSON_EXTRACT(json_expr, "$.temp")
Unfortunately the keys were unknown at runtime, so it was impossible to code a json path.
A first approach started out with a list of possible keys:
DECLARE wanted_keys array<STRING>
DEFAULT ARRAY["temp","Ambience", "sound"];
Then one could iterate over wanted_keys
to extract the values
SELECT
city_record.city,
wanted_key.name,
JSON_EXTRACT(city_record.payload, '$.'||wanted_key.name))
As it turns out, we cannot dynamically create a path argument in the JSON_EXTRACT
function. It has to be a constant or a query parameter (i.e. it needs to be constant).
So we introduced a custom Java Script to extract a value based on a dynamic key:
CREATE temp FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING) RETURNS STRING
LANGUAGE js AS """
try {
var obj = JSON.parse(json);
var result;
var arg;
if (obj) {
if (json_path && obj.hasOwnProperty(json_path)) {
result = JSON.stringify(obj[json_path]);
}
return (result.length ? result : false);
}
} catch (e) { return null }
""";
The query can now iterate over both the data records and the wanted keys and extracts one row at a time:
FOR city_record in (SELECT city, payload from DATA)
DO
FOR wanted_key in (SELECT * from unnest(wanted_keys) as name)
DO
IF city_record.payload like '%"'||wanted_key.name||'"%'
THEN
INSERT INTO results (
SELECT city_record.city, wanted_key.name,
CUSTOM_JSON_EXTRACT(city_record.payload, wanted_key.name));
END IF;
END FOR;
END FOR;
This works fine, however this has one glaring problem: every INSERT INTO results
statement inside this double loop is a Bigquery job! If you process 1000 rows with 3 possible keys you will be starting up to 3000 bigquery jobs. And they will be executed sequentially. All of Bigquery’s power of parallel processing is gone!
A Better Solution
The better solution turned out to:
- create a javascript that turns the dictionary into a list of key value pairs
- cross join these with the original row to create one row per key
- join these rows with the
wanted_keys
list
-- extract all key value pairs as an array from a json dict
-- input: json string with a dictionary
-- returns: list of struct <key, value>
CREATE TEMP FUNCTION EXTRACT_KV_PAIRS(json_str STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
try{
const json_dict = JSON.parse(json_str);
const all_kv = Object.entries(json_dict).map(
(r)=>Object.fromEntries([["key", r[0]],["value",
JSON.stringify(r[1])]]));
return all_kv;
} catch(e) { return [{"key": "error","value": e}];}
""";-- make our sample data
WITH
make_table AS(
SELECT "NYC" as City, '{"temp": 30, "elevation":120}' as payload
UNION ALL
SELECT "SF" , '{"Ambience": "high", "sound":"fair"}'
UNION ALL
SELECT "LA" , '{"temp": 10, "sound":"medium"}'
),
-- make the sample wanted keys
wanted_keys AS(
SELECT * FROM
UNNEST(ARRAY["temp","Ambience", "sound"])
AS wanted
),
-- turn the dictionary into a list
extracted_pairs as (
SELECT city,
EXTRACT_KV_PAIRS(payload) as kv_list
FROM make_table
),
-- now expand the list into individual rows
all_pairs AS(
SELECT city, kv_pair.key AS key, kv_pair.value AS value
FROM extracted_pairs
CROSS JOIN UNNEST (kv_list) as kv_pair
)
-- now restrict it to those rows
-- that match the keys we are looking for
SELECT city, key, value
FROM all_pairs
JOIN wanted_keys on key = wanted
This solution runs at full BigQuery speed. When the previous solution would run in the range of minutes and even hours, this solution runs in a few seconds