Using projections to speedup queries in Clickhouse

Denys Golotiuk
DataDenys
7 min readOct 28, 2022

--

Clickhouse is efficient enough so most analytical queries will execute fast in many cases without extra optimization activities. But, if you have frequent queries that require a lot of resources for execution, it’s a good idea to help Clickhouse find more efficient ways to execute those. First of all, make sure your primary key is fine tuned. After that, consider using projections to speedup certain queries.

Reasons queries are slow

Let’s imagine we have the following table:

CREATE TABLE test ( a UInt32, b UInt32, c UInt32, d UInt32 )
ENGINE = MergeTree ORDER BY (a, b)

Primary key will be used to efficiently execute queries on a and b columns (or any combination of):

SELECT count(*) FROM test WHERE a = 1000

This query will use primary key to optimize scanning:

But filtering on d or c will lead to a full table scan:

SELECT count(*) FROM test WHERE c = 1000

Which results in the following (our test table has 100 million ows):

This happens, because our table data is only sorted by a+b columns (and stored in that way):

Because c column is out of primary key, Clickhouse can’t help but scan the entire table to find what we’re looking for.

Using projections

Projections can be used to automatically build additional (usually) sorted datasets for tables, that can be used by Clickhouse to execute specific queries faster. So basically projection is a kind of a row db index, but more flexible.

To create projection on a table we should specify expression to build it on:

ALTER TABLE test ADD projection c (SELECT c ORDER BY c);
ALTER TABLE test MATERIALIZE projection c;

Here we’ve created created a projection with c name for the SELECT c ORDER BY c expression.

So now Clickhouse has all values from c column sorted and stored in a separate structure, called projection. If we create projection on a table with data, we need MATERIALIZE operation to build projection for existing data. Newely arrived data will be automatically processed and stored in projection.

Let’s check previous query after projection was created:

As we can see, our query became dramatically faster and ~1000x less resources consuming.

Picking projections for queries

We can have any [reasonable] number of projections per table. Clickhouse will try to pick best projection for each query. Say we have the following query:

SELECT count(*) FROM test WHERE c = 1000 AND d > 1000

Again, Clickhouse has to scan entire data set to find the answer:

We can create the following projection to speed things up in this case:

ALTER TABLE test ADD projection d_c (SELECT c, d ORDER BY c, d);
ALTER TABLE test MATERIALIZE projection d_c;

Now query is executed in a much more efficient way:

Good thing here is that d_c projection will also cover cases for filtering on c column only (as it comes first in ORDER BY c, d expression). Let’s first remove c projection:

ALTER TABLE test DROP PROJECTION c

Now, let’s make sure d_c projection is used for our query:

By using EXPLAIN we can check which projection is used for query:

EXPLAIN json = 1, indexes = 1 SELECT count(*) FROM test WHERE c = 
1000

Which gives us the following:

Aggregate projections

Cool thing about projections is that we can use aggregates and grouping. Let’s see how the following query performs (we still have d_c projection in table):

SELECT count(*), c FROM test WHERE c > 10 AND c < 20 GROUP BY c

We can see that Clickhouse had to scan something like the amount of collapsed rows under aggregation (9 *23.3k = 201k while 295k was scanned which is pretty close):

But we can actually create aggregated projection to improve things:

ALTER table test add projection c_count (SELECT count(*), c GROUP BY c);
ALTER TABLE test MATERIALIZE projection c_count;

Now the query is ~8 times more efficient since new projection is used:

Here our projection stores aggregated data on the fly (using AggregatingMergeTree backend):

And actually, processed rows value is only that big (43k) due to default granule size of 8k, meaning Clickhouse reads at least 8k rows per key hit.

Picking columns for projection

Let’s look at the following query:

SELECT count(*), max(b) FROM test WHERE c = 1000

Since we have projections on c column, but not on d column, Clickhouse will still have to read the entire table to execute this query:

Which is obviously bad and can be fixed by creating more relevant projection:

ALTER TABLE test ADD projection c_b (SELECT max(b), count(*) GROUP BY c);
ALTER TABLE test MATERIALIZE projection c_b;

Now our query performs much better:

This is important as this means we have to add all columns to projection that are used in the query. This is different from row-based databases indexes, which referrence entire rows. But columnar database doesn’t even know what row is, that’s why projection != index. Projection is just some part of the original table data stored in a different form, which is better suitable for certain queries.

Before speeding up [all] slow queries

Having “slow” [analytical] query is not always as bad, as wasting resources on supporting additional structures (like projections) at the write-time. If the query is executed couple of times per day and it takes couple of minutes (or even hours) to generate results… Do you need to speed it up? If the table structure is ok and primary key is tuned (which already means it serves most of querying cases best), maybe there’s a better way to handle some of slow queries, like scheduled precaching (so UI is fast)?

Why consider such things when we can just make all queries fast? Read-time optimization always comes with write-time overhead. E.g. let’s compare how throughput changes when we feed data into original table with no projections and to a table with 3 projections. We’ll insert 50 million rows into each table:

INSERT INTO test SELECT
round(number / 100000),
round(number / 100),
round(rand32() / 1000000),
round(rand32() / 1000)
FROM numbers(50000000)

Original table was filled in 3 seconds:

But the table with projections took Clickhouse 7x time to fill with the same 50 million rows:

It doesn’t mean we should never use projections. Instead we need to carefully analyze both ends (read and write performance) to make sure the whole system is performant instead of a single query. So the best candidate for optimization is usually a query that is frequent (popular) enough to create negative impact on the whole system.

From this perspective it is more important to optimize query that takes 1 second but is being executed every couple of seconds than 5-minutes-long query that’s executed once a day.

Summary

Projections can be used to improve query performance when primary key can’t handle it. In order to use projection we only need to create it and Clickhouse will automatically pick it up for relevant queries:

ALTER TABLE tbl ADD projection p1 (SELECT col1 ORDER BY col1);

Projections can be created on expressions as well using functions and aggregations:

ALTER TABLE tbl 
ADD projection p2 (SELECT col2, count(*), MAX(col1) GROUP BY col2);

Note, that each projection will downgrade data ingestion performance, so use projections in cases when system is highly affected by slow queries. In other cases consider using (pre)caching or asking your analyst to be patient :)

--

--