Clickhouse table compression improvement

How to improve Clickhouse table compression

Denys Golotiuk
DataDenys
Published in
4 min readJul 11, 2022

--

As we know, Clickhouse uses columns compression so tables takes little space on disk. Compression efficiency depends on data and table structure we choose. Let’s see how picking up good ordering key can increase table compression by 10 times and even more.

Why compression can differ?

Clickhouse stores each column data in a separate data file on disk (they have .bin extension). Those files are compressed already. But important thing here, is that Clickhouse stores records in those files accordingly to ORDER key we’ve chosen during table creation:

In this case column1 column will be the first in ORDER key, so it will have all values sorted. But column2 and column3 will have its values sorted accordingly to key prefix (value from previous columns in key first).

And exactly this fact can dramatically affect compression efficiency. Why? Because, in two words, repeated values are better compressed than randomly distributed:

We can see the difference by generating two files with random and ordered numbers:

clickhouse-client -q "SELECT round(number/100) FROM numbers(100000)" > ordered.csvclickhouse-client -q "SELECT round(number/100) FROM numbers(100000) ORDER BY rand()" > unordered.csv

Here we generate 100k numbers from 1 to 1000 and store them to corresponding files (second one is randomly sorted). Their sizes are equal:

ls -la ordered.csv 
389150 лип 11 19:00 ordered.csv
ls -la unordered.csv
389150 лип 11 19:00 unordered.csv

Now let’s compress them:

gzip ordered.csv 
gzip unordered.csv

And finally we can see that ordered file is compressed 76 (!) times better than unordered one:

ls -la ordered.csv.gz 
2125 лип 11 19:00 ordered.csv.gz
ls -la unordered.csv.gz
162629 лип 11 19:00 unordered.csv.gz

How compression depends on key columns order

Let’s create 2 tables of the same structure with different key columns order:

CREATE TABLE test1 (`a` String, `b` String )
ENGINE = MergeTree ORDER BY (b, a)
CREATE TABLE test2 (`a` String, `b` String )
ENGINE = MergeTree ORDER BY (a, b)

First table is ordered by a column first and then b column. Second table is ordered by b column first and then by a column.

Let’s insert same data into both tables:

INSERT INTO test1 SELECT round(number/1000), number FROM numbers(1000000);
INSERT INTO test2 SELECT * FROM test1

Here we insert generated numbers into both columns. We divide values by 1000 for a column to make sure this column have repeating values (we will compare compression for this column).

Let’s check the difference between compressions levels for 2 tables:

SELECT
table AS Table,
name AS Column,
formatReadableSize(data_uncompressed_bytes) AS Uncompressed,
formatReadableSize(data_compressed_bytes) AS Compressed,
round(data_uncompressed_bytes / data_compressed_bytes, 0) AS Ratio
FROM system.columns
WHERE ((table = 'test1') OR (table = 'test2')) AND (name = 'a')

And find out that:

Difference between compression depending on key columns order

As we see, test2 table show better compression level for a column. This is the table that had a column first in order key. Why? Remember we’ve inserted repeated values in a and unique in b?

SELECT uniq(a), uniq(b) FROM test2┌─uniq(a)─┬─uniq(b)─┐
1001995890
└─────────┴─────────┘

Amount of unique values (also called cardinality) is much higher for b column than a column. That means, that having a column (lower cardinality) first in order key, results in having all repeated values comes next to each other (sorted):

And in case a column comes second, it’s repeated values are not well sorted and can’t be efficiently compressed:

As we can see, having columns with lower cardinality coming first in order key can result in better compression levels for data on disk.

Summary

Pay attention to picking key columns order, because this will affect how efficinetly Clickhouse can compress data. Placing columns with lower cardinality (lower portion of unique values to all values) first will result in better compression ratios. Consider performance needs as well.

--

--

Denys Golotiuk
DataDenys

Data-intensive apps engineer, tech writer, opensource contributor @ github.com/mrcrypster