Memory efficient unique values calculation on large datasets in Clichouse

Published in

DataDenys

4 min readAug 22, 2022

If you are dealing with analytical DB, calculating unique values for datasets is a frequent task for you. Sometimes you have tables so large, Clickhouse can’t fit into memory (which is needed in a standard procedure of uniques calculation) to answer your question. But Clickhouse has tools to workaround those cases, let’s see what are those…

Table with 500 million random string values

We have prepared a test table of the following structure:

Table is populated with half a billion randomly generated values.

Unique values calculation problem

In order to calculate number of unique values in a table we usually use:

SELECT count(DISTINCT s) FROM test

Then our DB basicaly does the following:

Creates an empty list of unique values (in memory),
Checks each value from table, if it’s not on the list — adds it,
Returns count of items in the temporary list.

So we’ll have to use (roughly) as much memory as there are unique values in a table multiplied by each value length:

For large tables with a lot of distinct values we can end up with out of memory errors:

We can quickly fix that (for debug purposes only, never on production) with:

SET max_memory_usage = [very large number in bytes]

Now let’s find out how much memory was consumed by this query:

SELECT formatReadableSize(memory_usage)
FROM system.query_log
WHERE query LIKE 'select count(distinct s) from test1;'
ORDER BY event_time DESC
LIMIT 1Query id: 4c12c727-a6c2-414e-8c42-e09213415fcc┌─formatReadableSize(memory_usage)─┐
│ 20.02 GiB                        │
└──────────────────────────────────┘

Wow, this is even more than our entire table takes on disk:

SELECT formatReadableSize(sum(bytes))
FROM system.parts
WHERE active AND (table = 'test1')Query id: ad9959da-3daa-4a74-84b5-5cb1c500a3b5┌─formatReadableSize(sum(bytes_on_disk))─┐
│ 6.54 GiB                               │
└────────────────────────────────────────┘

That’s because we have everything uncompressed in memory.

Unique values the right way

Clickhouse has several functions that optimize memory consumption by calculating hashes of values and using storage optimization algorithms. This not only saves memory, but also increases speed impressively. The price for improvement — we loose data precision, but let’s see if it’s that bad.

Adaptive sampling

This algorithm is used by uniq() function and gives us:

Which is ~ 30x faster that distinct approach.

HyperLogLog

This approach allows some tuning as you can specify HLL precision (17 by default). Algorithm is used by uniqueCombined() function and gives:

Which is ~ 25x faster than distinct approach and a little slower than adaptive sampling.

`DISTINCT vs uniq() vs uniqCombined()`

Let’s compapre 3 approaches in terms of precision, speed and memory consumptions. We have used HLL algotihm with 3 different precision levels — at 12 (min), 17 (default) and 20 (max):

As you can see, optimized functions are 1000x more efficient in terms of memory consumption. Average precision is around 99.8%, but using HLL with maximum (20) precision level resulted in 99.95% without much increase in memory usage and processing time.

Summary

In cases you do not need exact precision, use uniqCombined() instead of count(distinct() to get number of unique values in a table. You can tune this method by using precision levels from 12 to 20, where 20 is the highest precision:

SELECT uniqCombined(17)(column) FROM table