Mastering Data Manipulation: ClickHouse and JSON Unraveled for Analysts and Developers!

Published in

ReversingLabs Engineering

18 min readSep 15, 2023

https://pixabay.com/illustrations/analytics-information-innovation-3088958/

Introduction

Unlocking the potential of data is a significant goal for businesses worldwide. As data volume and complexity increase, the need for efficient data manipulation is evident. ClickHouse and JSON form a dynamic duo that can revolutionize data work.

In this digital age, data holds supreme importance, and knowing how to harness its power is crucial for analysts, programmers, developers, and IT professionals.

This blog explores ClickHouse, an advanced database management system, and its capabilities for handling JSON data structures.

Let’s delve into the world of ClickHouse and JSON to unleash the true potential of your data.

What is JSON?

JSON (JavaScript Object Notation) is a widely used and versatile file format for storing and transmitting data objects. It employs a human-readable text format, consisting of attribute-value pairs and arrays, making it ideal for various data interchange needs, particularly in web applications. Originally from JavaScript, JSON is now adopted by modern programming languages due to its ease of generating and parsing JSON-formatted data. However, despite its popularity, working with JSON data can be challenging, especially with semi-structured information from sources like IoT, web, and mobile platforms. These sources often lack a conventional relational structure, making it difficult to explore JSON data using traditional business intelligence (BI) and analytics tools. Innovative solutions like ClickHouse are necessary to efficiently address these challenges.

What is ClickHouse?

ClickHouse is a column-oriented database management system (DBMS) designed for online analytical processing (OLAP). ClickHouse excels at handling large-scale data, outperforming its counterparts with remarkable speed. It effortlessly processes billions of rows and tens of gigabytes of data per server, per second. In this blog, we’ll explore the latest version (at the time of writing), specifically 23.6.2 revision 54464, along with the ClickHouse client version 23.6.2.18.

Functions in ClickHouse

In ClickHouse, functions play a crucial role in addressing various data processing scenarios. There are two main types: regular functions and aggregate functions, each serving a distinct purpose. Regular functions operate independently on each row, producing results unaffected by other rows. In contrast, aggregate functions take a collective approach, relying on the entire dataset to accumulate values. This fundamental difference enables ClickHouse to efficiently handle a wide range of data processing tasks. To explore ClickHouse functions further, you can find comprehensive resources available at [2].

JSON functions

ClickHouse’s JSON functions provide powerful tools to efficiently handle JSON data. These functions are built on robust assumptions about the JSON structure, ensuring maximum effectiveness with minimal complexity. A key assumption is that the field name given as a function argument must be constant. ClickHouse also assumes the field name is encoded canonically within the JSON and can be found at any nesting level. In case of multiple matching fields, the functions prioritize the first occurrence. It’s essential to avoid space characters outside of string literals in the JSON.

In this blog, we’ll explore practical examples of ClickHouse’s JSON functions, showcasing their versatility and utility. We’ll focus on the functions available in the dedicated server version used during the writing of this blog. By the end of this exploration, you’ll have a comprehensive understanding of how to use ClickHouse’s JSON functions to unlock the full potential of your JSON data.

Real-world example — Structured vs Unstructured

In data management, the decision between structured and unstructured data can profoundly affect how information is stored, accessed, and analyzed. To illustrate this essential difference, we’ll examine a real-world example using NGINX logs in JSON format. These logs provide an ideal case to discuss standard techniques for handling JSON data.

Our dataset comprises a simple logging entry, represented below:

{
  "@timestamp": 893964617,
  "clientip": "40.135.0.0",
  "request": {
    "method": "GET",
    "path": "/images/hm_bg.jpg",
    "version": "HTTP/1.0"
  },
  "status": 200,
  "size": 24736
}

In this blog, we’ll explore various approaches for storing and processing data, including structured and unstructured data handling methods. By examining the nuances of each approach, we aim to provide you with the knowledge and tools to make informed decisions when working with diverse datasets. Whether you’re a data enthusiast, seasoned analyst, or aspiring data scientist, join us on this enlightening journey as we unravel the intricacies of structured and unstructured data and learn how to navigate the data landscape effectively.

Structured data

Structured data brings order and predictability. With a fixed schema in JSON, we gain optimal performance by controlling codecs, data-skipping indexes, and leveraging keys. Join us as we explore the power of structured data in this blog.

Stored as Structured Data using Nested

First, we will create a table like this:

CREATE TABLE IF NOT EXISTS default.http_structured_nested
(
    `@timestamp` Int32,
    `clientip` IPv4,
    `request` Nested(method LowCardinality(String), path String, version LowCardinality(String)),
    `status` UInt16,
    `size` UInt32,
    `timestamp` DateTime DEFAULT toDateTime(`@timestamp`)
)
ENGINE = MergeTree
ORDER BY (status, timestamp)
SETTINGS index_granularity = 8192;

Then, we will insert data into http_structured_nested like:

SET input_format_import_nested_json = 1;
INSERT INTO http_structured_nested (`@timestamp`, clientip, request.method, request.path, request.version, status, size) SELECT
    `@timestamp`,
    clientip,
    [request['method']],
    [request['path']],
    [request['version']],
    status,
    size
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONEachRow', '`@timestamp` Int32, clientip IPv4, request Map(String, String), status UInt16, size UInt16');

Now we check how many records we have in our table and display the first 5 rows:

SELECT count() AS `# of messages`
FROM http_structured_nested;
 
-- Output
┌─# of messages─┐
│      10000000 │
└───────────────┘
 
SELECT *
FROM http_structured_nested
LIMIT 5;
 
-- Output
 
┌─@timestamp─┬─clientip───┬─request.method─┬─request.path─────────────────┬─request.version─┬─status─┬──size─┬───────────timestamp─┐
│  893964617 │ 40.135.0.0 │ ['GET']        │ ['/images/hm_bg.jpg']        │ ['HTTP/1.0']    │    200 │ 24736 │ 1998-04-30 19:30:17 │
│  893964653 │ 232.0.0.0  │ ['GET']        │ ['/images/hm_bg.jpg']        │ ['HTTP/1.0']    │    200 │ 24736 │ 1998-04-30 19:30:53 │
│  893964672 │ 26.1.0.0   │ ['GET']        │ ['/images/hm_bg.jpg']        │ ['HTTP/1.0']    │    200 │ 24736 │ 1998-04-30 19:31:12 │
│  893964679 │ 247.37.0.0 │ ['GET']        │ ['/french/splash_inet.html'] │ ['HTTP/1.0']    │    200 │  3781 │ 1998-04-30 19:31:19 │
│  893964687 │ 252.0.0.0  │ ['GET']        │ ['/images/hm_bg.jpg']        │ ['HTTP/1.0']    │    200 │ 24736 │ 1998-04-30 19:31:27 │
└────────────┴────────────┴────────────────┴──────────────────────────────┴─────────────────┴────────┴───────┴─────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    status,
    request.method[1] AS method,
    count() AS `# number of messages`
FROM http_structured_nested
WHERE (status >= 400) AND ((timestamp >= '1998-01-01 00:00:00') AND (timestamp <= '1998-06-01 00:00:00'))
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method──┬─# number of messages─┐
│    404 │ GET     │                11267 │
│    404 │ HEAD    │                  276 │
│    500 │ GET     │                  160 │
│    500 │ POST    │                  115 │
│    400 │ GET     │                   81 │
│    400 │ POST    │                   64 │
│    404 │ OPTIONS │                   32 │
│    500 │ OTHER   │                   19 │
│    404 │ POST    │                   12 │
│    404 │ OTHER   │                    3 │
│    500 │ HEAD    │                    2 │
└────────┴─────────┴──────────────────────┘

Note: Array indices in ClickHouse are 1-based.

Stored as Structured Data using Tuples

We can represent the nested object request as a Tuple. This provides comparable functionality to nested.

First, we will create a table like this:

CREATE TABLE IF NOT EXISTS http_structured_tuples
(
    `@timestamp` Int32,
    `clientip` IPv4,
    `request` Tuple(method LowCardinality(String), path String, version LowCardinality(String)),
    `status` UInt16,
    `size` UInt32,
    `timestamp` DateTime DEFAULT toDateTime(`@timestamp`)
)
ENGINE = MergeTree
ORDER BY (status, request, timestamp);

Then, we will insert data into http_structured_tuples like:

INSERT INTO http_structured_tuples (`@timestamp`, clientip, request, status, size) SELECT
    `@timestamp`,
    clientip,
    (request['method'], request['path'], request['version']),
    status,
    size
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONEachRow', '`@timestamp` Int32, clientip IPv4, request Map(String, String), status UInt16, size UInt16');

Now we check how many records we have in our table and display the first 5 rows:

SELECT count() AS `# of messages`
FROM http_structured_tuples;
 
-- Output
 
┌─# of messages─┐
│      10000000 │
└───────────────┘
 
SELECT *
FROM http_structured_tuples
LIMIT 5;
 
-- Output
 
┌─@timestamp─┬─clientip───┬─request────────────────┬─status─┬─size─┬───────────timestamp─┐
│  893966408 │ 23.0.0.0   │ ('GET','/','HTTP/1.0') │    200 │ 8712 │ 1998-04-30 20:00:08 │
│  893966414 │ 35.0.0.0   │ ('GET','/','HTTP/1.0') │    200 │ 8712 │ 1998-04-30 20:00:14 │
│  893966414 │ 232.41.0.0 │ ('GET','/','HTTP/1.0') │    200 │ 8749 │ 1998-04-30 20:00:14 │
│  893966425 │ 47.0.0.0   │ ('GET','/','HTTP/1.0') │    200 │ 8712 │ 1998-04-30 20:00:25 │
│  893966436 │ 55.0.0.0   │ ('GET','/','HTTP/1.0') │    200 │ 8712 │ 1998-04-30 20:00:36 │
└────────────┴────────────┴────────────────────────┴────────┴──────┴─────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    status,
    request.method AS method,
    count() AS `# number of messages`
FROM http_structured_tuples
WHERE (status >= 400) AND ((timestamp >= '1998-01-01 00:00:00') AND (timestamp <= '1998-06-01 00:00:00'))
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method──┬─# number of messages─┐
│    404 │ GET     │                11267 │
│    404 │ HEAD    │                  276 │
│    500 │ GET     │                  160 │
│    500 │ POST    │                  115 │
│    400 │ GET     │                   81 │
│    400 │ POST    │                   64 │
│    404 │ OPTIONS │                   32 │
│    500 │ OTHER   │                   19 │
│    404 │ POST    │                   12 │
│    404 │ OTHER   │                    3 │
│    500 │ HEAD    │                    2 │
└────────┴─────────┴──────────────────────┘

Stored as Structured Data using Maps

Maps represent a simple way to represent nested structures.

First, we will create a table like this:

CREATE TABLE IF NOT EXISTS http_structured_maps
(
    `@timestamp` Int32,
    `clientip` IPv4,
    `request` Map(String, String),
    `status` UInt16,
    `size` UInt32,
    `timestamp` DateTime DEFAULT toDateTime(`@timestamp`)
)
ENGINE = MergeTree
ORDER BY (status, request, timestamp);

Then, we will insert data into http_structured_maps like:

INSERT INTO http_structured_maps (`@timestamp`, clientip, request, status, size) SELECT
    `@timestamp`,
    clientip,
    request,
    status,
    size
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONEachRow', '`@timestamp` Int32, clientip IPv4, request Map(String, String), status UInt16, size UInt16');

Now we check how many records we have in our table and display the first 5 rows:

SELECT count() AS `# of messages`
FROM http_structured_maps;
 
-- Output
 
┌─# of messages─┐
│      10000000 │
└───────────────┘
 
SELECT *
FROM http_structured_maps
LIMIT 5;
 
-- Output
 
┌─@timestamp─┬─clientip───┬─request──────────────────────────────────────────┬─status─┬─size─┬───────────timestamp─┐
│  893966408 │ 23.0.0.0   │ {'method':'GET','path':'/','version':'HTTP/1.0'} │    200 │ 8712 │ 1998-04-30 20:00:08 │
│  893966414 │ 35.0.0.0   │ {'method':'GET','path':'/','version':'HTTP/1.0'} │    200 │ 8712 │ 1998-04-30 20:00:14 │
│  893966414 │ 232.41.0.0 │ {'method':'GET','path':'/','version':'HTTP/1.0'} │    200 │ 8749 │ 1998-04-30 20:00:14 │
│  893966425 │ 47.0.0.0   │ {'method':'GET','path':'/','version':'HTTP/1.0'} │    200 │ 8712 │ 1998-04-30 20:00:25 │
│  893966436 │ 55.0.0.0   │ {'method':'GET','path':'/','version':'HTTP/1.0'} │    200 │ 8712 │ 1998-04-30 20:00:36 │
└────────────┴────────────┴──────────────────────────────────────────────────┴────────┴──────┴─────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    status,
    request['method'] AS method,
    count() AS `# number of messages`
FROM http_structured_maps
WHERE (status >= 400) AND ((timestamp >= '1998-01-01 00:00:00') AND (timestamp <= '1998-06-01 00:00:00'))
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method──┬─# number of messages─┐
│    404 │ GET     │                11267 │
│    404 │ HEAD    │                  276 │
│    500 │ GET     │                  160 │
│    500 │ POST    │                  115 │
│    400 │ GET     │                   81 │
│    400 │ POST    │                   64 │
│    404 │ OPTIONS │                   32 │
│    500 │ OTHER   │                   19 │
│    404 │ POST    │                   12 │
│    404 │ OTHER   │                    3 │
│    500 │ HEAD    │                    2 │
└────────┴─────────┴──────────────────────┘

Unstructured data

Handling data using the structured approach is often not viable for those users with dynamic JSON which is either subject to change or for which the schema is not well understood.

Stored as Unstructured Data using String

For absolute flexibility, users can simply store JSON as Strings before using functions to extract fields as required. This represents the extreme opposite of handling JSON as a structured object.

First, we will create a table like this:

CREATE TABLE IF NOT EXISTS http_unstructured_string
(
    `message` String
)
ENGINE = MergeTree
ORDER BY tuple();

Then, we will insert data into http_unstructured_string like:

INSERT INTO http_unstructured_string SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONAsString', 'message String');

Now we check how many records we have in our table and display the first 5 rows:

SELECT count() AS `# of messages`
FROM http_unstructured_string;
 
-- Output:
 
┌─# of messages─┐
│      10000000 │
└───────────────┘
 
SELECT *
FROM http_unstructured_string
LIMIT 5;
 
-- Output
 
┌─message─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {"@timestamp":893964617,"clientip":"40.135.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}       │
│ {"@timestamp":893964653,"clientip":"232.0.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}        │
│ {"@timestamp":893964672,"clientip":"26.1.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}         │
│ {"@timestamp":893964679,"clientip":"247.37.0.0","request":{"method":"GET","path":"/french/splash_inet.html","version":"HTTP/1.0"},"status":200,"size":3781} │
│ {"@timestamp":893964682,"clientip":"247.37.0.0","request":{"method":"GET","path":"/images/hm_nbg.jpg","version":"HTTP/1.0"},"status":304,"size":0}          │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    simpleJSONExtractUInt(message, 'status') AS status,
    simpleJSONExtractString(message, 'method') AS method,
    count() AS `# number of messages`
FROM http_unstructured_string
WHERE (status >= 400) AND ((toDateTime(simpleJSONExtractUInt(message, '@timestamp')) >= '1998-01-01 00:00:00') AND (toDateTime(simpleJSONExtractUInt(message, '@timestamp')) <= '1998-06-01 00:00:00'))
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method──┬─# number of messages─┐
│    404 │ GET     │                11267 │
│    404 │ HEAD    │                  276 │
│    500 │ GET     │                  160 │
│    500 │ POST    │                  115 │
│    400 │ GET     │                   81 │
│    400 │ POST    │                   64 │
│    404 │ OPTIONS │                   32 │
│    500 │ OTHER   │                   19 │
│    404 │ POST    │                   12 │
│    404 │ OTHER   │                    3 │
│    500 │ HEAD    │                    2 │
└────────┴─────────┴──────────────────────┘

Similarly, we can use the JSON* family of functions to achieve the same results. In the example above we used the simpleJSON* aliases related to the visitParam* functions which offer potentially superior performance against the JSON* family of functions, primarily by making strict assumptions as to the structure and format of the JSON.

Stored as Unstructured Data using Pairwise Arrays

Pairwise arrays provide a balance between the flexibility of representing JSON as Strings and the performance of a more structured approach. The schema is flexible in that any new fields can be potentially added to the root.

First, we will create a table like this:

CREATE TABLE IF NOT EXISTS http_unstructured_pairwise_arrays
(
    `keys` Array(String),
    `values` Array(String)
)
ENGINE = MergeTree
ORDER BY tuple();

Then, we will insert data into http_unstructured_pairwise_arrays like:

INSERT INTO http_unstructured_pairwise_arrays SELECT
    arrayMap(x -> (x.1), JSONExtractKeysAndValues(message, 'String')) AS keys,
    arrayMap(x -> (x.2), JSONExtractKeysAndValues(message, 'String')) AS values
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONAsString', 'message String');

Now we check how many records we have in our table and display the first 5 rows:

SELECT count() AS `# number of messages`
FROM http_unstructured_pairwise_arrays;
 
-- Output
 
┌─# number of messages─┐
│             10000000 │
└──────────────────────┘
 
SELECT *
FROM http_unstructured_pairwise_arrays
LIMIT 5;
 
-- Output
 
┌─keys────────────────────────────────────────────────┬─values────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ['@timestamp','clientip','request','status','size'] │ ['893964617','40.135.0.0','{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"}','200','24736']       │
│ ['@timestamp','clientip','request','status','size'] │ ['893964653','232.0.0.0','{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"}','200','24736']        │
│ ['@timestamp','clientip','request','status','size'] │ ['893964672','26.1.0.0','{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"}','200','24736']         │
│ ['@timestamp','clientip','request','status','size'] │ ['893964679','247.37.0.0','{"method":"GET","path":"/french/splash_inet.html","version":"HTTP/1.0"}','200','3781'] │
│ ['@timestamp','clientip','request','status','size'] │ ['893964682','247.37.0.0','{"method":"GET","path":"/images/hm_nbg.jpg","version":"HTTP/1.0"}','304','0']          │
└─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    toUInt16(values[indexOf(keys, 'status')]) AS status,
    simpleJSONExtractString(values[indexOf(keys, 'request')], 'method') AS method,
    count() AS `# number of messages`
FROM http_unstructured_pairwise_arrays
WHERE (status >= 400) AND ((toDateTime(values[indexOf(keys, '@timestamp')]) >= '1998-01-01 00:00:00') AND (toDateTime(values[indexOf(keys, '@timestamp')]) <= '1998-06-01 00:00:00'))
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method─┬─# number of messages─┐
│    404 │ GET    │                11267 │
│    404 │ HEAD   │                  276 │
│    500 │ GET    │                  160 │
│    500 │ POST   │                  115 │
│    400 │ GET    │                   81 │
│    400 │ POST   │                   64 │
│    404 │ OPTIONS│                   32 │
│    500 │ OTHER  │                   19 │
│    404 │ POST   │                   12 │
│    404 │ OTHER  │                    3 │
│    500 │ HEAD   │                    2 │
└────────┴────────┴──────────────────────┘

Advantages and disadvantages of the approaches described so far

Hybrid Approach using Materialized Views — one more to go

The approaches outlined above are not either OR. While parsing JSON fields to structured columns offers the best query performance, it also potentially incurs the highest insertion overhead if done in ClickHouse. Practically, it is also sometimes not possible due to dirty or variable data or even a potentially unknown schema. Conversely, keeping the JSON as Strings or using pairwise arrays, while flexible, significantly increases query complexity and makes accessing the data the function of someone with ClickHouse expertise.

As a compromise, users can use a hybrid approach, initially representing the JSON as a string and extracting columns as required. While not essential, Materialized Views can assist with this.

The general concept here is to exploit a table with the null engine for receiving inserts. This table engine doesn’t store any data and acts as a “buffer” for the materialized view only. For each insert block, the materialized view will trigger, perform the processing required and insert rows into a target table that we can in turn query. n cases where we need to update the schema to extract a new field from the blob, we simply update our table schema and then modify the materialized view accordingly to extract the field. The described approach is acting like an ETL pipeline.

First, we create our null table engine for receiving inserts:

CREATE TABLE IF NOT EXISTS hybrid_approach_buffer
(
    `message` String
)
ENGINE = Null;

Our target MergeTree table has a subset of the fields — ones we are maybe confident will occur in the JSON string. Note we retain a String field message for other data that can be used with JSON* functions if required.

CREATE TABLE hybrid_approach_target
(
    `message` String,
    `method` String,
    `status` UInt16,
    `size` UInt32,
    `timestamp` DateTime
)
ENGINE = MergeTree
ORDER BY (status, timestamp);

Our materialized view in turn extracts the fields that have been declared in the http table schema.

CREATE MATERIALIZED VIEW hybrid_approach_target_mv TO hybrid_approach_target AS
SELECT
    message,
    JSONExtractString(JSONExtractString(message, 'request'), 'method') AS method,
    toUInt16(JSONExtractInt(message, 'status')) AS status,
    toUInt32(JSONExtractInt(message, 'size')) AS size,
    toDateTime(JSONExtractUInt(message, '@timestamp')) AS timestamp
FROM hybrid_approach_buffer;

Insert data into hybrid_approach_buffer like:

INSERT INTO hybrid_approach_buffer SELECT message
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/documents-01.ndjson.gz', 'JSONAsString', 'message String');

Now we check how many records we have in our target table and display the first 5 rows:

SELECT count() AS `# of messages`
FROM hybrid_approach_target;
 
-- Output
 
┌─# of messages─┐
│      10000000 │
└───────────────┘
 
SELECT *
FROM hybrid_approach_target
LIMIT 5;
 
-- Output
 
┌─message─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─method─┬─status─┬──size─┬───────────timestamp─┐
│ {"@timestamp":893964617,"clientip":"40.135.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}       │ GET    │    200 │ 24736 │ 1998-04-30 19:30:17 │
│ {"@timestamp":893964653,"clientip":"232.0.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}        │ GET    │    200 │ 24736 │ 1998-04-30 19:30:53 │
│ {"@timestamp":893964672,"clientip":"26.1.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}         │ GET    │    200 │ 24736 │ 1998-04-30 19:31:12 │
│ {"@timestamp":893964679,"clientip":"247.37.0.0","request":{"method":"GET","path":"/french/splash_inet.html","version":"HTTP/1.0"},"status":200,"size":3781} │ GET    │    200 │  3781 │ 1998-04-30 19:31:19 │
│ {"@timestamp":893964687,"clientip":"252.0.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}        │ GET    │    200 │ 24736 │ 1998-04-30 19:31:27 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴────────┴───────┴─────────────────────┘

We are ready to perform some analysis. For example, we can query data for a fixed time period and with some additional conditions like:

SELECT
    status,
    method,
    count() AS `# number of messages`
FROM hybrid_approach_target
WHERE (status >= 400) AND (timestamp >= '1998-01-01 00:00:00') AND (timestamp <= '1998-06-01 00:00:00')
GROUP BY
    method,
    status
ORDER BY `# number of messages` DESC;
 
-- Output
 
┌─status─┬─method──┬─# number of messages─┐
│    404 │ GET     │                11267 │
│    404 │ HEAD    │                  276 │
│    500 │ GET     │                  160 │
│    500 │ POST    │                  115 │
│    400 │ GET     │                   81 │
│    400 │ POST    │                   64 │
│    404 │ OPTIONS │                   32 │
│    500 │ OTHER   │                   19 │
│    404 │ POST    │                   12 │
│    404 │ OTHER   │                    3 │
│    500 │ HEAD    │                    2 │
└────────┴─────────┴──────────────────────┘

Suppose we later wish to extract the field client_ip from our JSON blob. First, we update our target table like:

ALTER TABLE hybrid_approach_target
    ADD COLUMN `client_ip` IPv4;

Using the setting allow_experimental_alter_materialized_view_structure we can modify our Materialized View:

SET allow_experimental_alter_materialized_view_structure = 1;
ALTER TABLE hybrid_approach_target_mv
    MODIFY QUERY
    SELECT
        message,
        JSONExtractString(JSONExtractString(message, 'request'), 'method') AS method,
        toUInt16(JSONExtractInt(message, 'status')) AS status,
        toUInt32(JSONExtractInt(message, 'size')) AS size,
        toIPv4(JSONExtractString(message, 'clientip')) AS client_ip,
        toDateTime(JSONExtractUInt(message, '@timestamp')) AS timestamp
    FROM hybrid_approach_buffer;
-- update existing data
ALTER TABLE hybrid_approach_target
    UPDATE client_ip = toIPv4(JSONExtractString(message, 'clientip')) WHERE 1 = 1;

Note how this feature is experimental. You can alternatively drop the view using DROP VIEW and recreate it - however, this does require pausing insertions.

Finally, we can display the first 5 rows:

SELECT *
FROM hybrid_approach_target
LIMIT 5
 
-- Output
 
┌─message─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─method─┬─status─┬──size─┬───────────timestamp─┬─client_ip──┐
│ {"@timestamp":893964617,"clientip":"40.135.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}       │ GET    │    200 │ 24736 │ 1998-04-30 19:30:17 │ 40.135.0.0 │
│ {"@timestamp":893964653,"clientip":"232.0.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}        │ GET    │    200 │ 24736 │ 1998-04-30 19:30:53 │ 232.0.0.0  │
│ {"@timestamp":893964672,"clientip":"26.1.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}         │ GET    │    200 │ 24736 │ 1998-04-30 19:31:12 │ 26.1.0.0   │
│ {"@timestamp":893964679,"clientip":"247.37.0.0","request":{"method":"GET","path":"/french/splash_inet.html","version":"HTTP/1.0"},"status":200,"size":3781} │ GET    │    200 │  3781 │ 1998-04-30 19:31:19 │ 247.37.0.0 │
│ {"@timestamp":893964687,"clientip":"252.0.0.0","request":{"method":"GET","path":"/images/hm_bg.jpg","version":"HTTP/1.0"},"status":200,"size":24736}        │ GET    │    200 │ 24736 │ 1998-04-30 19:31:27 │ 252.0.0.0  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴────────┴───────┴─────────────────────┴────────────┘

BONUS — Storage Usage

To check storage usage we can execute a query like this:

SELECT
    columns.table,
    parts.rows,
    parts.latest_modification,
    parts.disk_size,
    parts.primary_keys_size,
    parts.engine,
    parts.bytes_size,
    columns.compressed_size,
    columns.uncompressed_size,
    columns.ratio,
    columns.compression_percentage
FROM
(
    SELECT
        table,
        formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
        formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
        sum(data_compressed_bytes) / sum(data_uncompressed_bytes) AS ratio,
        round(100 - ((sum(data_compressed_bytes) * 100) / sum(data_uncompressed_bytes)), 3) AS compression_percentage
    FROM system.columns
    WHERE database = currentDatabase()
    GROUP BY table
) AS columns
LEFT JOIN
(
    SELECT
        table,
        sum(rows) AS rows,
        max(modification_time) AS latest_modification,
        formatReadableSize(sum(bytes)) AS disk_size,
        formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_keys_size,
        any(engine) AS engine,
        sum(bytes) AS bytes_size
    FROM system.parts
    WHERE active AND (database = currentDatabase())
    GROUP BY
        database,
        table
) AS parts USING (table)
ORDER BY parts.bytes_size DESC;
 
-- Output
 
┌─table─────────────────────────────┬─────rows─┬─latest_modification─┬─disk_size──┬─primary_keys_size─┬─engine────┬─bytes_size─┬─compressed_size─┬─uncompressed_size─┬───────────────ratio─┬─compression_percentage─┐
│ hybrid_approach_target            │ 10000000 │ 2023-07-24 09:18:14 │ 256.14 MiB │ 7.18 KiB          │ MergeTree │  268584714 │ 256.12 MiB      │ 1.65 GiB          │ 0.15151770600093792 │                 84.848 │
│ http_unstructured_string          │ 10000000 │ 2023-07-24 08:44:45 │ 209.50 MiB │ 0.00 B            │ MergeTree │  219673798 │ 209.49 MiB      │ 1.48 GiB          │ 0.13794516080394234 │                 86.205 │
│ http_unstructured_pairwise_arrays │ 10000000 │ 2023-07-24 08:54:14 │ 181.33 MiB │ 0.00 B            │ MergeTree │  190142149 │ 181.32 MiB      │ 1.49 GiB          │ 0.11864996363757062 │                 88.135 │
│ http_structured_nested            │ 10000000 │ 2023-07-24 08:14:44 │ 135.23 MiB │ 7.18 KiB          │ MergeTree │  141801775 │ 135.20 MiB      │ 534.87 MiB        │ 0.25277885437730807 │                 74.722 │
│ http_structured_maps              │ 10000000 │ 2023-07-24 08:36:44 │ 106.15 MiB │ 147.16 KiB        │ MergeTree │  111302978 │ 106.11 MiB      │ 830.48 MiB        │ 0.12776830300782893 │                 87.223 │
│ http_structured_tuples            │ 10000000 │ 2023-07-24 08:27:53 │ 104.54 MiB │ 52.57 KiB         │ MergeTree │  109618320 │ 104.51 MiB      │ 458.58 MiB        │ 0.22788808896433202 │                 77.211 │
│ hybrid_approach_buffer            │        0 │ 1970-01-01 00:00:00 │            │                   │           │          0 │ 0.00 B          │ 0.00 B            │                 nan │                    nan │
│ hybrid_approach_target_mv         │        0 │ 1970-01-01 00:00:00 │            │                   │           │          0 │ 0.00 B          │ 0.00 B            │                 nan │                    nan │
└───────────────────────────────────┴──────────┴─────────────────────┴────────────┴───────────────────┴───────────┴────────────┴─────────────────┴───────────────────┴─────────────────────┴────────────────────────┘

Or by columns:

SELECT
    table,
    name,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
    sum(data_compressed_bytes) / sum(data_uncompressed_bytes) AS ratio,
    round(100 - ((sum(data_compressed_bytes) * 100) / sum(data_uncompressed_bytes)), 3) AS compression_percentage
FROM system.columns
WHERE database = currentDatabase()
GROUP BY
    table,
    name
ORDER BY ratio DESC
 
-- Output
 
┌─table─────────────────────────────┬─name────────────┬─uncompressed_size─┬─compressed_size─┬─────────────────ratio─┬─compression_percentage─┐
│ http_structured_tuples            │ @timestamp      │ 38.15 MiB         │ 36.11 MiB       │           0.946519525 │                  5.348 │
│ http_structured_maps              │ @timestamp      │ 38.15 MiB         │ 36.11 MiB       │           0.946519525 │                  5.348 │
│ http_structured_maps              │ timestamp       │ 38.15 MiB         │ 36.11 MiB       │           0.946519525 │                  5.348 │
│ http_structured_tuples            │ timestamp       │ 38.15 MiB         │ 36.11 MiB       │           0.946519525 │                  5.348 │
│ http_structured_maps              │ clientip        │ 38.15 MiB         │ 29.85 MiB       │             0.7825496 │                 21.745 │
│ http_structured_tuples            │ clientip        │ 38.15 MiB         │ 29.85 MiB       │             0.7825496 │                 21.745 │
│ hybrid_approach_target            │ size            │ 38.15 MiB         │ 24.19 MiB       │            0.63411565 │                 36.588 │
│ http_structured_nested            │ size            │ 38.15 MiB         │ 24.18 MiB       │             0.6339797 │                 36.602 │
│ hybrid_approach_target            │ client_ip       │ 38.15 MiB         │ 23.51 MiB       │            0.61624615 │                 38.375 │
│ http_structured_nested            │ clientip        │ 38.15 MiB         │ 23.51 MiB       │            0.61624615 │                 38.375 │
│ http_structured_nested            │ request.version │ 9.57 MiB          │ 3.40 MiB        │   0.35546045482188965 │                 64.454 │
│ http_structured_nested            │ request.path    │ 267.79 MiB        │ 73.51 MiB       │    0.2745214905649135 │                 72.548 │
│ http_unstructured_pairwise_arrays │ values          │ 1.05 GiB          │ 179.39 MiB      │    0.1675882981894168 │                 83.241 │
│ http_unstructured_string          │ message         │ 1.48 GiB          │ 209.49 MiB      │   0.13794516080394234 │                 86.205 │
│ hybrid_approach_target            │ message         │ 1.48 GiB          │ 203.00 MiB      │   0.13366800866544679 │                 86.633 │
│ hybrid_approach_target            │ timestamp       │ 38.15 MiB         │ 4.98 MiB        │             0.1306476 │                 86.935 │
│ http_structured_nested            │ @timestamp      │ 38.15 MiB         │ 4.98 MiB        │             0.1306476 │                 86.935 │
│ http_structured_nested            │ timestamp       │ 38.15 MiB         │ 4.98 MiB        │             0.1306476 │                 86.935 │
│ http_structured_tuples            │ size            │ 38.15 MiB         │ 552.36 KiB      │           0.014140525 │                 98.586 │
│ http_structured_maps              │ size            │ 38.15 MiB         │ 552.36 KiB      │           0.014140525 │                 98.586 │
│ hybrid_approach_target            │ method          │ 38.18 MiB         │ 363.84 KiB      │  0.009307128908970036 │                 99.069 │
│ http_structured_nested            │ request.method  │ 85.86 MiB         │ 558.95 KiB      │  0.006357472386761622 │                 99.364 │
│ http_structured_tuples            │ request         │ 286.92 MiB        │ 1.82 MiB        │ 0.0063261270751326605 │                 99.367 │
│ http_structured_maps              │ request         │ 658.82 MiB        │ 3.42 MiB        │ 0.0051897670827434446 │                 99.481 │
│ hybrid_approach_target            │ status          │ 19.07 MiB         │ 87.49 KiB       │             0.0044795 │                 99.552 │
│ http_structured_tuples            │ status          │ 19.07 MiB         │ 87.49 KiB       │             0.0044795 │                 99.552 │
│ http_structured_maps              │ status          │ 19.07 MiB         │ 87.49 KiB       │             0.0044795 │                 99.552 │
│ http_structured_nested            │ status          │ 19.07 MiB         │ 87.49 KiB       │             0.0044795 │                 99.552 │
│ http_unstructured_pairwise_arrays │ keys            │ 457.76 MiB        │ 1.93 MiB        │  0.004211497916666667 │                 99.579 │
│ hybrid_approach_buffer            │ message         │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
│ hybrid_approach_target_mv         │ method          │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
│ hybrid_approach_target_mv         │ status          │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
│ hybrid_approach_target_mv         │ size            │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
│ hybrid_approach_target_mv         │ timestamp       │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
│ hybrid_approach_target_mv         │ message         │ 0.00 B            │ 0.00 B          │                   nan │                    nan │
└───────────────────────────────────┴─────────────────┴───────────────────┴─────────────────┴───────────────────────┴────────────────────────┘

Final thoughts

In conclusion, working with JSON data structures presents various challenges and use cases. ClickHouse emerges as a robust solution, offering a range of functions to tackle these challenges effectively. However, handling JSON data can be approached in different ways, each with its own advantages and limitations. Making the most suitable choice depends on multiple factors, making the decision-making process nuanced and complex. By being well-informed about the strengths and weaknesses of each approach, we can make optimal choices for our specific requirements. This blog has explored ClickHouse’s power and different strategies for handling JSON data, providing valuable insights to optimize our data-driven efforts. Let’s embrace this knowledge and continue to adapt and innovate as the data landscape evolves in our pursuit of data excellence.

References

[ 1 ] https://clickhouse.com/

[ 2 ] https://clickhouse.com/docs/en/sql-reference/functions/

[ 3 ] https://en.wikipedia.org/wiki/JSON

Mastering Data Manipulation: ClickHouse and JSON Unraveled for Analysts and Developers!

Introduction

What is JSON?

What is ClickHouse?

Functions in ClickHouse

JSON functions

Real-world example — Structured vs Unstructured

Structured data

Stored as Structured Data using Nested

Stored as Structured Data using Tuples

Stored as Structured Data using Maps

Unstructured data

Stored as Unstructured Data using String

Stored as Unstructured Data using Pairwise Arrays

Advantages and disadvantages of the approaches described so far

Hybrid Approach using Materialized Views — one more to go

BONUS — Storage Usage

Final thoughts

References

Written by Vedran Rukavina