Extracting JSON key value pairs in BigQuery

Published in

Google Cloud - Community

3 min readOct 24, 2022

Short Intro to JSON in BigQuery

Bigquery introduced the ablity to deal with JSON structures a while ago. It supports both string values that represent JSON structures as well as a native JSON data type. You can find a good introduction here.

Bigquery offers the ability to query JSON (and strings with JSON) using path statements such as this:

JSON_EXTRACT(json_field, "$.path.to.my.data")

Or short:

jsonfield.path.to.my.data

There is a number of convenience function for extracting data out of JSON but there are also a few constraints.

The Challenge

Recently I had a situation where we needed to extract entries in a json dictionary.

{
   "temp": 30, "elevation":120
}
{
   "Ambience": "high", "sound":"fair"
}
{
    "temp": 10, "sound":"bad"
}

The naive way would be to use:

JSON_EXTRACT(json_expr, "$.temp")

Unfortunately the keys were unknown at runtime, so it was impossible to code a json path.

A first approach started out with a list of possible keys:

DECLARE wanted_keys array<STRING> 
   DEFAULT  ARRAY["temp","Ambience", "sound"];

Then one could iterate over wanted_keys to extract the values

SELECT  
   city_record.city, 
   wanted_key.name,
  JSON_EXTRACT(city_record.payload, '$.'||wanted_key.name))

As it turns out, we cannot dynamically create a path argument in the JSON_EXTRACT function. It has to be a constant or a query parameter (i.e. it needs to be constant).

So we introduced a custom Java Script to extract a value based on a dynamic key:

CREATE temp  FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING) RETURNS STRING
LANGUAGE js AS """ 
  try {
     var obj = JSON.parse(json);
     var result;
     var arg;
     if (obj) {
        if (json_path && obj.hasOwnProperty(json_path)) {
          result = JSON.stringify(obj[json_path]);
        }
        return (result.length ? result : false);
     }
} catch (e) { return null }
""";

The query can now iterate over both the data records and the wanted keys and extracts one row at a time:

FOR city_record in (SELECT  city, payload from DATA)
DO
  FOR wanted_key in (SELECT * from unnest(wanted_keys) as name)
  DO
    IF city_record.payload like '%"'||wanted_key.name||'"%'
    THEN
      INSERT INTO results (
        SELECT  city_record.city, wanted_key.name,
        CUSTOM_JSON_EXTRACT(city_record.payload, wanted_key.name));
     END IF;
   END FOR;
END FOR;

This works fine, however this has one glaring problem: every INSERT INTO results statement inside this double loop is a Bigquery job! If you process 1000 rows with 3 possible keys you will be starting up to 3000 bigquery jobs. And they will be executed sequentially. All of Bigquery’s power of parallel processing is gone!

A Better Solution

The better solution turned out to:

create a javascript that turns the dictionary into a list of key value pairs
cross join these with the original row to create one row per key
join these rows with the wanted_keys list

-- extract all key value pairs as an array from a json dict
-- input: json string with a dictionary
-- returns:  list of struct <key, value>
CREATE TEMP  FUNCTION EXTRACT_KV_PAIRS(json_str STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
  try{ 
    const json_dict = JSON.parse(json_str); 
    const all_kv = Object.entries(json_dict).map(
        (r)=>Object.fromEntries([["key", r[0]],["value",  
                                   JSON.stringify(r[1])]]));
    return all_kv;
  } catch(e) { return [{"key": "error","value": e}];}
""";-- make our sample data
WITH
make_table AS(
   SELECT "NYC" as City, '{"temp": 30, "elevation":120}' as payload
   UNION ALL
   SELECT "SF" ,  '{"Ambience": "high", "sound":"fair"}'
   UNION ALL
   SELECT "LA" ,  '{"temp": 10, "sound":"medium"}'
),
-- make the sample wanted keys
wanted_keys AS(
  SELECT * FROM
  UNNEST(ARRAY["temp","Ambience", "sound"]) 
  AS wanted
),
-- turn the dictionary into a list
extracted_pairs as (
  SELECT city,
  EXTRACT_KV_PAIRS(payload) as kv_list 
  FROM make_table
),
-- now expand the list into individual rows
all_pairs AS(
  SELECT city, kv_pair.key AS key, kv_pair.value AS value
  FROM extracted_pairs
  CROSS JOIN UNNEST (kv_list) as kv_pair
)
-- now restrict it to those rows 
-- that match the keys we are looking for
SELECT city, key, value
FROM all_pairs
JOIN wanted_keys on key = wanted

This solution runs at full BigQuery speed. When the previous solution would run in the range of minutes and even hours, this solution runs in a few seconds

Extracting JSON key value pairs in BigQuery

Short Intro to JSON in BigQuery

The Challenge

A Better Solution

Written by Stephan Meyn