Clickjhouse JSON

Working with JSON in Clickhouse

Denys Golotiuk
DataDenys
Published in
5 min readOct 26, 2022

--

There’s plenty of cases when we can’t define data structure in advance due to its dynamic nature. Data objects can have different set of features from time to time or depending on other circumstances. In such cases we can use JSON to save dynamic data and Clickhouse provides handy tools to work with that.

Storing JSON data

The simplest (and single as of now) way to store JSON object is to create String typed column and store JSON text representation inthere:

CREATE TABLE test_string ( `t` DateTime, `v` String )
ENGINE = MergeTree ORDER BY t

We’re going to save JSON data under v column:

INSERT INTO test_string VALUES(now(), '{"name":"Joe","age":95}')

Clickhouse provides several functions to work with JSON data, including validating JSON data itself.

Validating and checking keys

We can simply check if JSON is valid using isValidJSON function:

SELECT v, isValidJSON(v) FROM test_string

Which returns 1 (or 0) if JSON is valid (or not):

We can also check if object contains certain key (or any of given keys) which is useful during data cleansing procedures:

SELECT
JSONHas(v, 'name') AND JSONHas(v, 'rating') AS is_valid,
count(*)
FROM test_string GROUP BY is_valid

Here we require our JSON field to have name and rating keys defined to be marked as valid:

As we can see, our table contains 2 invalid values.

Extracting values

In most cases we want to operate with JSON attributes (keys) values, which can be done in multiple ways. First of all there is a function to extract typed values based on key names:

SELECT
JSONExtract(v, 'name', 'String'),
JSONExtract(v, 'rating', 'UInt32')
FROM test_string LIMIT 5

Here we extract name key value as string and rating key value as [unsigned 4 bytes] integer from v column value (which stores JSON):

We can extract scalar types (like String, Int/UInt* or Float*) or complex structures, like Array or Tuple (or combinations of any):

SELECT JSONExtract('{"val": [1,2,3,4]}', 'val', 'Array(UInt8)')[2]

Which will return second integer value from val array (which is a JSON key):

Another option to extract value (without having to think about types) is to use JSON_VALUE function:

SELECT JSON_VALUE(v, '$.name') FROM test_string LIMIT 5

Which, again, will simply list value of name key of v JSON column:

Using JSON key values for indexing

As we know, Clickhouse allows using function to build sorting keys based on. That’s why we can use JSON extraction functions in indexes to optimize certain queries:

CREATE TABLE test_index ( `t` Int64, `v` String )
ENGINE = MergeTree ORDER BY JSONExtractUInt(v, 'rating')

Here, we’ve used extracted rating key value as a sorting key, so the relevant queries will be quite efficient:

SELECT count(*) FROM test_index
WHERE JSONExtractUInt(v, 'rating') = 140970729

Clickhouse will answer in the very efficient way in this case since it uses index for that:

On the contrary, if we filter on the JSON key which is not under index, full table scan follows. For example, the same query on the same table/data, but with a different sort index results in the following:

250x rows scanned, 250x RAM used and 35x time spent. Not good.

JSON keys or separate columns?

If you know your JSON field is going to have exactly certain keys of exactly certain types, it is better to move them out of JSON value into separate columns before writing to Clickhouse. Doing so will save space (because of saving data using relevant types instead of strings) and improve performance (since no JSON extraction overhead will happen).

Experimental JSON object type

Clickhouse has experimental JSON type which is too early to consider for production, but let’s take a look at that. Since the feature is experimental we have to:

SET allow_experimental_object_type = 1

Now we can use JSON type while creating tables:

CREATE TABLE test_json ( `t` DateTime, `v` JSON )
ENGINE = MergeTree ORDER BY t

First cool thing about this type, is that we can directly use object notation in queries:

SELECT v.name, v.rating FROM test_json LIMIT 5

And Clickhouse knows what to do:

Second cool thing about this type is that it takes less space (~30% saved in our case) than String field usage:

But one downside here is the speed of inserting data. If we compare insertion speed of the same data into String and JSON columns we can see almost 10 fold difference:

Anyway, this is still experimental, so hopefully Clickhouse team will document (in details) and release this soon.

Summary

Working with JSON in Clickhouse is as simple as using String field for data and then using set of JSON* functions to check and extract key values from JSON fields:

SELECT JSONExtract(json_col, 'key', 'String') FROM table

We can use extraction functions as indexes (or part of indexes) to improve query performance, but consider moving those to separate columns instead and leave only dynamic set of keys for storing in JSON columns.

--

--

Denys Golotiuk
DataDenys

Data-intensive apps engineer, tech writer, opensource contributor @ github.com/mrcrypster