Sitemap
DataDenys

Technical articles on creating, scaling, optimizing and securing big data applications

Press enter or click to view image in full size
Clickhouse secondary index columns queries

Improving Clickhouse query performance by tuning key order

4 min readJul 13, 2022

--

Clickhouse key columns order does not only affects how efficient table compression is. Given primary key storage structure Clickhouse can faster or slower execute queries that use key columns but skip index prefix (first column(s) in key). Efficient key order might also improve join queries performance, especially for large tables. Let’s see how to measure efficiency and decide here.

Test table and data

Let’s start from creating and generate data for table to run our test on:

CREATE TABLE default.test (`a` String, `b` String, `c` String)
ENGINE = MergeTree ORDER BY (a, b, c)
INSERT INTO test SELECT
round(number / 100000), round(number / 1000), number
FROM numbers(50000000)

We’ve used round function to get different number of unique values in each column, which is also know as cardinality:

SELECT    uniq(a),    uniq(b),    uniq(c) FROM test

Which gives us:

Press enter or click to view image in full size

As we have the same columns order in our order key (a->b->c) let’s benchmark how fast search will be executed on each column:

Press enter or click to view image in full size

As we can see performance for each query results in scanning different amount of rows to get answer. Column a requires only 16k records to scan, while column c results in full table scan. That’s because data is stored using the following structure (read more on data storage here):

Press enter or click to view image in full size

Binary search and generic exclusion algorythms

When we use all columns from primary key or first column(s) from it (prefix), Clickhouse can use efficient algorythm to locate necessary granules quickly based on primary key:

Press enter or click to view image in full size

But in case we skip first column(s) from primary key and use only right part of it (secondary columns from index), Clickhouse uses algorythm:

Press enter or click to view image in full size

The approach is pretty simple. As we know Clickhouse primary key is organized into granules. Clickhouse will try to filter out granules that will not contain target values for sure:

Press enter or click to view image in full size

Clickhouse tries to exclude granules that will not contain searched values. That’s done by checking min and max values of target column in each granule than skip granules that have target value absent in min…max range.

Generic exclusion performance

Generic exclusion will perform better when more granules can be skipped. And this can be done if index prefix has some amount of duplicate values. And the bigger this amount the more granules we can skip.

That’s why, having low cardinality columns come first in key, will results in good efficiency for queries on secondary index columns. Let’s compare performance for case when we use high-cardinality column as a first column in index:

INSERT INTO test (b, c, a) SELECT
round(number / 100000),
round(number / 1000),
number
FROM numbers(50000000)

We’ve inserted 50 million records into test table table, but this time we’ve made a column a high cardinality column:

SELECT    uniq(a),    uniq(b),    uniq(c) FROM test┌──uniq(a)─┬─uniq(b)─┬─uniq(c)─┐
│ 50148092 │ 501 │ 50001 │
└──────────┴─────────┴─────────┘

Now we can test performance of the query on b column:

Press enter or click to view image in full size

As we can see, this time Clickhouse was unable to efficiently filter out granules and had to do full table scan to generate results. So having absolutely same data in table and executing same query results in 10x performance difference based on the key columns order:

Press enter or click to view image in full size

Summary

If you are not sure about order key columns, put columns with lower cardinality first and columns with higher cardinality last, this will ensure best performance for search on secondary index columns.

--

--

DataDenys
DataDenys

Published in DataDenys

Technical articles on creating, scaling, optimizing and securing big data applications

Denys Golotiuk
Denys Golotiuk

Written by Denys Golotiuk

I write about data, machine learning, and its practical implementation. github.com/mrcrypster https://datachild.net golotyuk@gmail.com

No responses yet