How to improve Clickhouse table compression
As we know, Clickhouse uses columns compression so tables takes little space on disk. Compression efficiency depends on data and table structure we choose. Let’s see how picking up good ordering key can increase table compression by 10 times and even more.
Why compression can differ?
Clickhouse stores each column data in a separate data file on disk (they have .bin
extension). Those files are compressed already. But important thing here, is that Clickhouse stores records in those files accordingly to ORDER
key we’ve chosen during table creation:
In this case column1
column will be the first in ORDER
key, so it will have all values sorted. But column2
and column3
will have its values sorted accordingly to key prefix (value from previous columns in key first).
And exactly this fact can dramatically affect compression efficiency. Why? Because, in two words, repeated values are better compressed than randomly distributed:
We can see the difference by generating two files with random and ordered numbers:
clickhouse-client -q "SELECT round(number/100) FROM numbers(100000)" > ordered.csvclickhouse-client -q "SELECT round(number/100) FROM numbers(100000) ORDER BY rand()" > unordered.csv
Here we generate 100k numbers from 1 to 1000 and store them to corresponding files (second one is randomly sorted). Their sizes are equal:
ls -la ordered.csv
389150 лип 11 19:00 ordered.csvls -la unordered.csv
389150 лип 11 19:00 unordered.csv
Now let’s compress them:
gzip ordered.csv
gzip unordered.csv
And finally we can see that ordered file is compressed 76 (!) times better than unordered one:
ls -la ordered.csv.gz
2125 лип 11 19:00 ordered.csv.gzls -la unordered.csv.gz
162629 лип 11 19:00 unordered.csv.gz
How compression depends on key columns order
Let’s create 2 tables of the same structure with different key columns order:
CREATE TABLE test1 (`a` String, `b` String )
ENGINE = MergeTree ORDER BY (b, a)CREATE TABLE test2 (`a` String, `b` String )
ENGINE = MergeTree ORDER BY (a, b)
First table is ordered by a
column first and then b
column. Second table is ordered by b
column first and then by a
column.
Let’s insert same data into both tables:
INSERT INTO test1 SELECT round(number/1000), number FROM numbers(1000000);
INSERT INTO test2 SELECT * FROM test1
Here we insert generated numbers into both columns. We divide values by 1000
for a
column to make sure this column have repeating values (we will compare compression for this column).
Let’s check the difference between compressions levels for 2 tables:
SELECT
table AS Table,
name AS Column,
formatReadableSize(data_uncompressed_bytes) AS Uncompressed,
formatReadableSize(data_compressed_bytes) AS Compressed,
round(data_uncompressed_bytes / data_compressed_bytes, 0) AS Ratio
FROM system.columns
WHERE ((table = 'test1') OR (table = 'test2')) AND (name = 'a')
And find out that:
As we see, test2
table show better compression level for a
column. This is the table that had a
column first in order
key. Why? Remember we’ve inserted repeated values in a
and unique in b
?
SELECT uniq(a), uniq(b) FROM test2┌─uniq(a)─┬─uniq(b)─┐
│ 1001 │ 995890 │
└─────────┴─────────┘
Amount of unique values (also called cardinality) is much higher for b
column than a
column. That means, that having a
column (lower cardinality) first in order key, results in having all repeated values comes next to each other (sorted):
And in case a
column comes second, it’s repeated values are not well sorted and can’t be efficiently compressed:
As we can see, having columns with lower cardinality coming first in order key
can result in better compression levels for data on disk.
Summary
Pay attention to picking key columns order, because this will affect how efficinetly Clickhouse can compress data. Placing columns with lower cardinality (lower portion of unique values to all values) first will result in better compression ratios. Consider performance needs as well.