Querying JSON using SQL — A comparison of Synapse, BigQuery, Databricks and Snowflake

A practical comparison of schema-on-read support for JSON in popular big data processing engines

Sindre Grindheim
Glitni
6 min readMay 2, 2022

--

This article was updated on February 25th, 2023, with the support for native JSON types in Google BigQuery.

Whether you’re building a cloud data warehouse or a data lakehouse, you’re bound to end up with loads of JSON files in your data lake as they’re frequently used for logging data and as output from APIs. Knowing how to analyze these semi structured files quickly using SQL is an essential skill as a data worker.

However, each of the popular data processing engines have different variations in how their SQL dialect is implemented. This article aims to show a practical comparison of how the same information is extracted using a select few tool, namely Synapse Serverless SQL pool, BigQuery, Databricks SQL and Snowflake.

Case

In order to have a fair comparison, we will be using the same dataset for all data processing engine. The case is a some store inventory data courtesy of the Databricks documentation, although slightly simplified for this article.

This dataset is selected due to it containing a fair variation of data structures, and contains both maps of maps as well as arrays.

What we’d like to do is list the owner of the store along with all the books, where we’re interested in the author and title. In addition, should there be any readers then we’d like to output that too, but in a raw, unprocessed form.

This means that the output should be the following:

Synapse — T-SQL

First out is Synapse Serverless SQL pool, that comes with every Azure Synapse Analytics workspace and is used to query data residing in data lakes. The SQL dialect is T-SQL, an ANSI compliant dialect of the SQL language used by Microsoft.

In T-SQL, JSON is not a native data type, and so one must typically use some, or all, of the following functions to extract all the required data elements:

When combining the above functions and applying them to the raw data we end up with the following code:

WITH raw_json AS (
SELECT '<full store_inventory.json content>' AS raw
)
SELECT
JSON_VALUE(raw, '$.owner') AS Owner,
JSON_VALUE(value, '$.author') AS Author,
JSON_VALUE(value, '$.title') AS Title,
JSON_QUERY(value, '$.reader') AS Readers
FROM raw_json,
OPENJSON(
(SELECT * FROM raw_json),
'$.store.book'
)

Clearly, this isn’t particularly intuitive if this is the first time you’re using these functions.

But before we look at how it works in data processing engines that treat JSON as a native data type, let’s look at the current similarities in implementation in BigQuery.

Google BigQuery

BigQuery is a serverless multicloud data warehouse provided by Google. Historically, the JSON support offered by BigQuery has had a lot of similarities with the T-SQL functions. However, BigQuery now supports JSON as a native data type and that radically simplifies the use of JSON in BigQuery.

Due to the now native support, elements are extracted using a dot or bracket notation from fields of the JSON type. There are three methods to create JSON values:

  • Create a JSON literal using SQL
  • parse_json: converts a string to JSON
  • to_json: converts a SQL type to JSON

In addition, the following functions will come in handy for more advanced use cases:

  • unnest: returns an array as a table with one row per element
  • json_query_array: extracts an array and returns it as an ARRAY<JSON> of JSON
  • json_value_array: extracts an array of scalar values and returns it as an ARRAY<STRING> of scalar values

This results in the following for our example:

WITH raw_json AS (
SELECT parse_json('<full store_inventory.json content>') AS raw
)
SELECT
raw.owner AS Owner,
book.author AS Author,
book.title AS Title,
book.reader AS Readers
FROM
raw_json,
UNNEST(
JSON_QUERY_ARRAY(raw.store.book)
) AS book

As you can see, this is many times easier than the old implementation of JSON support in BigQuery (which is now removed from this article), as well as that of T-SQL.

The implementation of Databricks SQL and Snowflake are closely related to this modern way of treating JSON as a native type. But since each database has their own variations, some of the functions are different.

Databricks SQL

Next is Databricks SQL, the technology built by Databricks to do “data warehousing on the lakehouse”. In 2022, Databricks SQL updated to use ANSI standard SQL dialect. JSON is natively supported by Databricks, and the change to ANSI brought improvements to the simplicity of querying JSON data in a modern manner.

Due to the native support, most elements can be extracted using a natural syntax that uses either a dot notation or a bracket notation. A simple example of both would be raw:store.fruit[0]['type’], in order to get the type of the first fruit (apple). However, for more complex types, the following functions come in handy:

  • from_json: returns a struct value with the provided schema
  • lateral view: used in conjunction with explode to fetch complex types and apply them to the original row

So how would a Databricks implementation of our inventory data look?

WITH raw_json AS (
SELECT '<full store_inventory.json content>' AS raw
)
SELECT
raw:owner AS Owner,
book:author AS Author,
book:title AS Title,
book:reader AS Readers
FROM
raw_json
LATERAL VIEW EXPLODE(FROM_JSON(raw:store.book, 'array<string>')) AS book

This structure is immediately easier to read and understand. And as this is a very simple test case, this simplicity is even more apparent when the complexity increases.

Side note: Although the basic JSON functions are simple to use, Databricks needs to know the schema before you can use them. The trick here is knowing that each level in a JSON file is typically either a map or an array, and only using 'array<string>' or 'map<string,string>' as the schema in the cases when you need to handle the more complex elements with from_json.

Snowflake

Finally, let’s look at Snowflake, a data platform powered by an elastic performance engine for “near-unlimited scale”.

Like Databricks, Snowflake has support for both dot notation and bracket notation when traversing the semi structured data, and they work in the same way.

However, an important distinction between the two are when element names does not conform to the SQL identifier rules, e.g. when there are spaces in the name. In these cases, you need to use double quotes in Snowflake where you would use a back tick ` in Databricks. An example of this is raw:"zip code".

Some other specific useful functions to be aware of are:

  • parse_json: interprets a string as a JSON document, returning the native variant value
  • lateral flatten: flattens compound values into multiple rows

And applied to our example it becomes:

WITH raw_json AS (
SELECT parse_json('<full store_inventory.json content>') AS raw
)
SELECT
raw:owner AS Owner,
value:author AS Author,
value:title AS Title,
value:reader AS Readers
FROM
raw_json,
lateral flatten(input => raw:store.book)

As you can see, this is very similar to both the BigQuery and Databricks SQL versions, but with a few variations. First of all, we needed to parse the JSON since it was saved as a string, plus the implementation of flattening is slightly different.

Final notes

Traditionally, JSON parsing in databases has been slow due it not being native and you had to rely on several cumbersome functions to get exactly what you wanted.

However, this is changing now in the newer data processing engines, with BigQuery now also having made the leap to the modern approach.

JSON is finally being treated like a native data type, and is thus easily available not only to SQL experts, but to all data analysts at all levels.

--

--