Clickhouse secondary index columns queries

Improving Clickhouse query performance by tuning key order

Denys Golotiuk
DataDenys

--

Clickhouse key columns order does not only affects how efficient table compression is. Given primary key storage structure Clickhouse can faster or slower execute queries that use key columns but skip index prefix (first column(s) in key). Efficient key order might also improve join queries performance, especially for large tables. Let’s see how to measure efficiency and decide here.

Test table and data

Let’s start from creating and generate data for table to run our test on:

CREATE TABLE default.test (`a` String, `b` String, `c` String)
ENGINE = MergeTree ORDER BY (a, b, c)
INSERT INTO test SELECT
round(number / 100000), round(number / 1000), number
FROM numbers(50000000)

We’ve used round function to get different number of unique values in each column, which is also know as cardinality:

SELECT    uniq(a),    uniq(b),    uniq(c) FROM test

Which gives us:

As we have the same columns order in our order key (a->b->c) let’s benchmark how fast search will be executed on each column:

As we can see performance for each query results in scanning different amount of rows to get answer. Column a requires only 16k records to scan, while column c results in full table scan. That’s because data is stored using the following structure (read more on data storage here):

Binary search and generic exclusion algorythms

When we use all columns from primary key or first column(s) from it (prefix), Clickhouse can use efficient binary search algorythm to locate necessary granules quickly based on primary key:

But in case we skip first column(s) from primary key and use only right part of it (secondary columns from index), Clickhouse uses generic exclusion search algorythm:

The approach is pretty simple. As we know Clickhouse primary key is organized into granules. Clickhouse will try to filter out granules that will not contain target values for sure:

Clickhouse tries to exclude granules that will not contain searched values. That’s done by checking min and max values of target column in each granule than skip granules that have target value absent in min…max range.

Generic exclusion performance

Generic exclusion will perform better when more granules can be skipped. And this can be done if index prefix has some amount of duplicate values. And the bigger this amount the more granules we can skip.

That’s why, having low cardinality columns come first in key, will results in good efficiency for queries on secondary index columns. Let’s compare performance for case when we use high-cardinality column as a first column in index:

INSERT INTO test (b, c, a) SELECT
round(number / 100000),
round(number / 1000),
number
FROM numbers(50000000)

We’ve inserted 50 million records into test table table, but this time we’ve made a column a high cardinality column:

SELECT    uniq(a),    uniq(b),    uniq(c) FROM test┌──uniq(a)─┬─uniq(b)─┬─uniq(c)─┐
│ 50148092 │ 501 │ 50001 │
└──────────┴─────────┴─────────┘

Now we can test performance of the query on b column:

As we can see, this time Clickhouse was unable to efficiently filter out granules and had to do full table scan to generate results. So having absolutely same data in table and executing same query results in 10x performance difference based on the key columns order:

Summary

If you are not sure about order key columns, put columns with lower cardinality first and columns with higher cardinality last, this will ensure best performance for search on secondary index columns.

--

--