Working with JSON in Clickhouse
There’s plenty of cases when we can’t define data structure in advance due to its dynamic nature. Data objects can have different set of features from time to time or depending on other circumstances. In such cases we can use JSON to save dynamic data and Clickhouse provides handy tools to work with that.
Storing JSON data
The simplest (and single as of now) way to store JSON object is to create String
typed column and store JSON text representation inthere:
CREATE TABLE test_string ( `t` DateTime, `v` String )
ENGINE = MergeTree ORDER BY t
We’re going to save JSON data under v
column:
INSERT INTO test_string VALUES(now(), '{"name":"Joe","age":95}')
Clickhouse provides several functions to work with JSON data, including validating JSON data itself.
Validating and checking keys
We can simply check if JSON is valid using isValidJSON
function:
SELECT v, isValidJSON(v) FROM test_string
Which returns 1
(or 0
) if JSON is valid (or not):
We can also check if object contains certain key (or any of given keys) which is useful during data cleansing procedures:
SELECT
JSONHas(v, 'name') AND JSONHas(v, 'rating') AS is_valid,
count(*)
FROM test_string GROUP BY is_valid
Here we require our JSON field to have name
and rating
keys defined to be marked as valid:
As we can see, our table contains 2 invalid values.
Extracting values
In most cases we want to operate with JSON attributes (keys) values, which can be done in multiple ways. First of all there is a function to extract typed values based on key names:
SELECT
JSONExtract(v, 'name', 'String'),
JSONExtract(v, 'rating', 'UInt32')
FROM test_string LIMIT 5
Here we extract name
key value as string and rating
key value as [unsigned 4 bytes] integer from v
column value (which stores JSON):
We can extract scalar types (like String
, Int/UInt*
or Float*
) or complex structures, like Array
or Tuple
(or combinations of any):
SELECT JSONExtract('{"val": [1,2,3,4]}', 'val', 'Array(UInt8)')[2]
Which will return second integer value from val
array (which is a JSON key):
Another option to extract value (without having to think about types) is to use JSON_VALUE
function:
SELECT JSON_VALUE(v, '$.name') FROM test_string LIMIT 5
Which, again, will simply list value of name
key of v
JSON column:
Using JSON key values for indexing
As we know, Clickhouse allows using function to build sorting keys based on. That’s why we can use JSON extraction functions in indexes to optimize certain queries:
CREATE TABLE test_index ( `t` Int64, `v` String )
ENGINE = MergeTree ORDER BY JSONExtractUInt(v, 'rating')
Here, we’ve used extracted rating
key value as a sorting key, so the relevant queries will be quite efficient:
SELECT count(*) FROM test_index
WHERE JSONExtractUInt(v, 'rating') = 140970729
Clickhouse will answer in the very efficient way in this case since it uses index for that:
On the contrary, if we filter on the JSON key which is not under index, full table scan follows. For example, the same query on the same table/data, but with a different sort index results in the following:
250x rows scanned, 250x RAM used and 35x time spent. Not good.
JSON keys or separate columns?
If you know your JSON field is going to have exactly certain keys of exactly certain types, it is better to move them out of JSON value into separate columns before writing to Clickhouse. Doing so will save space (because of saving data using relevant types instead of strings) and improve performance (since no JSON extraction overhead will happen).
Experimental JSON object type
Clickhouse has experimental JSON type which is too early to consider for production, but let’s take a look at that. Since the feature is experimental we have to:
SET allow_experimental_object_type = 1
Now we can use JSON
type while creating tables:
CREATE TABLE test_json ( `t` DateTime, `v` JSON )
ENGINE = MergeTree ORDER BY t
First cool thing about this type, is that we can directly use object notation in queries:
SELECT v.name, v.rating FROM test_json LIMIT 5
And Clickhouse knows what to do:
Second cool thing about this type is that it takes less space (~30%
saved in our case) than String
field usage:
But one downside here is the speed of inserting data. If we compare insertion speed of the same data into String
and JSON
columns we can see almost 10 fold difference:
Anyway, this is still experimental, so hopefully Clickhouse team will document (in details) and release this soon.
Summary
Working with JSON in Clickhouse is as simple as using String
field for data and then using set of JSON*
functions to check and extract key values from JSON fields:
SELECT JSONExtract(json_col, 'key', 'String') FROM table
We can use extraction functions as indexes (or part of indexes) to improve query performance, but consider moving those to separate columns instead and leave only dynamic set of keys for storing in JSON columns.