Deduplicating data in Clickhouse

Estimating duplicates and deduplicating data in Clickhouse

Denys Golotiuk
DataDenys

--

Clickhouse has a way to quickly estimate number of duplicates in a table as well as remove duplicates so only unique rows remain. Popular solution here is to GROUP BY columns and insert resulting data into new table. But this can be a challenge if tables are large and disks are full. Clickhouse has an efficient way to handle both tasks efficiently.

Test data

Let’s assume we have the following table:

CREATE TABLE default.test
(
`a` UInt32,
`b` UInt32,
`c` UInt32
)
ENGINE = MergeTree
ORDER BY tuple()

We’ve populated it with 100 million rows of random data:

INSERT INTO test
SELECT rand32() / 100000, rand32() / 10000, rand32() / 1000
FROM numbers(100000000)

Estimating unique rows number

Clickhouse uniq() function can accept multiple arguments so estimating the number of unique rows can be done with the following query:

SELECT uniq(a,b,c) FROM test

Which gives us an approximate number of unique rows (based on 3 columns) in our table:

Hack: if we need to find unique rows based on all columns in table, we can simply use * instead of all columns names as an argument of uniq() function:

SELECT uniq(*) FROM test

To find a fraction of unique rows, we can use the following:

SELECT uniq(*)/count(*) FROM test

This will give us:

Which means we only have something like 4.3% of unique rows in a table, while the rest 95.7% is duplicated data.

Removing duplicates

Clickhouse has a built in tool to remove duplicates from MergeTree tables using DEDUPLICATE option in OPTIMIZE statement:

OPTIMIZE TABLE test FINAL DEDUPLICATE

We need FINAL here to deduplicate data even if it’s within a single part already. It took something like 7 seconds in our case:

Now we can make sure we do not have any duplicates in the table:

This doesn’t give us 100% since uniq() is an approximation. In order to get accurate calculation we can use the following (memory consuming) query:

SELECT count(distinct a,b,c)/count(*) FROM test;

And this will prove we no longer have duplicates in our table:

Deduplicating based on a subset of columns

If data needs to be deduplicated based on some columns only (instead of all columns), we can add BY expression:

OPTIMIZE TABLE test FINAL DEDUPLICATE BY a,b;

Here, we have removed duplicates based on (a,b) columns only instead of all columns.

Summary

Fraction of unique rows in a table can be quickly estimated using uniq() function:

SELECT uniq(*)/count(*) FROM table

If this gives values which are far less than 1, then we have a lot of duplicates in the table. Duplicated rows can be removed using the following query:

OPTIMIZE TABLE test FINAL DEDUPLICATE

This will leave only unique rows in the table.

--

--