Working with JSON data in BigQuery

Using the new JSON data type

BigQuery recently announced that it will support JSON as a data type. Let’s explore what this means, and what it means for us.

Photo by Javad Esmaeili on Unsplash

JSON support in BigQuery is not new

First of all, JSON support in BigQuery is not new. For example, BigQuery has had JSON querying and JSON generation functions for a long time. You can do:

SELECT 
JSON_QUERY('{"user_id": 3, "name": "Alice","address": {"street": "A Ave","city": "Aville"}}',
"$.address.city"),
TO_JSON_STRING(STRUCT(5 AS user_id, STRUCT('Bob' AS name)))

and this returns:

What’s new is that JSON is a data type (plus a set of functions to work with that data type) — think of how Geography is a type, and how that opened up the entire universe of ST_* functions instead of us having to muck around with latitude and longitude calculations ourselves.

JSON data type

You can create a field that is of type JSON in two ways:*

  • By calling PARSE_JSON on a string in JSON format. SAFE.PARSE_JSON tells BigQuery to use null for the object if there is a syntax error in the JSON string.
  • By calling TO_JSON on a STRUCT

This illustrates both approaches:

WITH input_table AS (
SELECT SAFE.PARSE_JSON('{"user_id": 3, "name": "Alice","address": {"street": "A Ave","city": "Aville"}}') AS user
UNION ALL
SELECT TO_JSON(STRUCT(5 AS user_id, 'Bob' as name,
STRUCT('B Blvd' AS street, 'Bville' AS city) AS address)) AS user
)
SELECT
user,
user['name'],
INT64(user.user_id),
user.address.street
FROM input_table

The code above also illustrates how to work with the JSON data type:

  • user is a column. So you can select it and you get the entire object.
  • You can pull out pieces of the JSON using syntax like user[‘name’]
  • You can also extract data using dot syntax. This returns a JSON object . You can have it converted to the type you want: INT64(user.user_id)
  • You can use multiple dots to go into nested structures. As deep as 500.

If you have arrays, you can access the elements using [0], [1], etc.

*There is a third way to create a JSON data type — if you have a table with a column that is defined to be of type JSON, you can INSERT a string into that column and get auto-conversion:

INSERT INTO NEW_TABLE_WITH_JSON_TYPE
SELECT * FROM OLD_TABLE_WITH_STRINGS;

The JSON data type is more performant

So, if we were already able to query and generate JSON, what’s the point of the JSON data type?

Performance (and by implication, cost).

When you create a column of type JSON, the data is stored in parsed form. So, you get all the benefits of a columnar database.

In order to try this out, let’s create a new BigQuery dataset and simulate a table with JSON values. In one column, we’ll store the JSON as a string and in the other, we’ll store it as a JSON type:

-- START Random generation function
CREATE TEMP FUNCTION MapChar(c INT64) AS (
CASE
WHEN c BETWEEN 0 AND 9 THEN 48 + c -- 0 to 9
WHEN c BETWEEN 10 AND 35 THEN 55 + c -- A to Z
ELSE 61 + c -- a to z
END
);
CREATE TEMP FUNCTION random_string(id INTEGER) AS ((
SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(MapChar(MOD(c, 62))))
FROM UNNEST(TO_CODE_POINTS(MD5(CAST(RAND() + id AS STRING)))) AS c
));
-- END Random generation function
CREATE OR REPLACE TABLE advdata.simulated_json ASWITH input_data AS (
SELECT STRUCT(id AS user_id,
STRUCT(random_string(id) AS name, random_string(id) AS city) AS address) AS user
FROM UNNEST(GENERATE_ARRAY(1, 1000*1000)) AS id
)
SELECT
TO_JSON(user) AS user_json,
TO_JSON_STRING(user) AS user_json_string

The query above generates one million STRUCT named user populated with random values. These look like this:

Let’s write a query the “old” way:

CREATE TEMPORARY FUNCTION INT64(x STRING) AS
(
CAST(JSON_VALUE(x) AS INT64)
);
SELECT AVG(INT64(JSON_QUERY(user_json_string, "$.user_id")))
FROM advdata.simulated_json

The result as you’d expect is 500,000.5 but more interesting is the amount of data processed: 80 MB.

This is the size of the entire JSON string.

Now, let’s write the query the new way, taking advantage of the JSON data type and dot syntax to parse it:

SELECT AVG(INT64(user_json.user_id))
FROM advdata.simulated_json

This time, the query will pull only the user_id “sub-column” and end up processing only 7.6 MB of data!

So, querying is much more efficient if you store log or other data that comes as JSON strings as a column whose type is JSON (and not string).

It is also less expensive since you are processing less data (if you are paying on-demand). If you are paying flat-rate, you use your slots for a shorter time period, so you can do more with the slots you have.

Enjoy!

Thanks for Chris Crosbie helpful comments and suggestions on this post.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store