Estimating duplicates and deduplicating data in Clickhouse
Clickhouse has a way to quickly estimate number of duplicates in a table as well as remove duplicates so only unique rows remain. Popular solution here is to GROUP BY columns
and insert resulting data into new table. But this can be a challenge if tables are large and disks are full. Clickhouse has an efficient way to handle both tasks efficiently.
Test data
Let’s assume we have the following table:
CREATE TABLE default.test
(
`a` UInt32,
`b` UInt32,
`c` UInt32
)
ENGINE = MergeTree
ORDER BY tuple()
We’ve populated it with 100 million rows of random data:
INSERT INTO test
SELECT rand32() / 100000, rand32() / 10000, rand32() / 1000
FROM numbers(100000000)
Estimating unique rows number
Clickhouse uniq()
function can accept multiple arguments so estimating the number of unique rows can be done with the following query:
SELECT uniq(a,b,c) FROM test
Which gives us an approximate number of unique rows (based on 3 columns) in our table:
Hack: if we need to find unique rows based on all columns in table, we can simply use *
instead of all columns names as an argument of uniq()
function:
SELECT uniq(*) FROM test
To find a fraction of unique rows, we can use the following:
SELECT uniq(*)/count(*) FROM test
This will give us:
Which means we only have something like 4.3%
of unique rows in a table, while the rest 95.7%
is duplicated data.
Removing duplicates
Clickhouse has a built in tool to remove duplicates from MergeTree
tables using DEDUPLICATE
option in OPTIMIZE
statement:
OPTIMIZE TABLE test FINAL DEDUPLICATE
We need FINAL
here to deduplicate data even if it’s within a single part already. It took something like 7
seconds in our case:
Now we can make sure we do not have any duplicates in the table:
This doesn’t give us 100%
since uniq()
is an approximation. In order to get accurate calculation we can use the following (memory consuming) query:
SELECT count(distinct a,b,c)/count(*) FROM test;
And this will prove we no longer have duplicates in our table:
Deduplicating based on a subset of columns
If data needs to be deduplicated based on some columns only (instead of all columns), we can add BY
expression:
OPTIMIZE TABLE test FINAL DEDUPLICATE BY a,b;
Here, we have removed duplicates based on (a,b)
columns only instead of all columns.
Summary
Fraction of unique rows in a table can be quickly estimated using uniq()
function:
SELECT uniq(*)/count(*) FROM table
If this gives values which are far less than 1, then we have a lot of duplicates in the table. Duplicated rows can be removed using the following query:
OPTIMIZE TABLE test FINAL DEDUPLICATE
This will leave only unique rows in the table.