In a couple of words many databases use probabilistic data structures like HyperLogLog or HLL for short. It is used for unique/distinct calculations in Spark, ElasticSearch, Flink, Postgres, BigQuery and Redis to name a few (curious reader will check links to explore different implementations of HLL). But usually you can apply this function only once (aggregate functions), for example for number of unique users per month — get a single number and be happy with it. And you aren’t be able to deal with or store the internal representation of this structure. So you couldn’t reuse pre-aggregated data if you want. In ClickHouse it is possible because HLL structure is consistent.
ClickHouse is blazingly fast and based on idea of dealing with raw data and not to pre-aggregate data beforehand. But let’s make an experiment. For example we need to calculate some metric for unique users of last month.
The idea: pre-aggregate it per day, and then sum up all results. It’s so called bucket approach — later you could sum up only last 30 measurements for last month, or last 7 to figure out the statistic for last week.
Create our pre-aggregate table:
create table events_unique (date Date, group_id String, client_id String, event_type String, product_id String, value AggregateFunction(uniq, String)) ENGINE = MergeTree(date, (group_id, client_id, event_type, product_id, date), 8192);
Here I’m declaring that my aggregate is AggregateFunction(uniq, String). So we interested in some unique metric that is calculated on String column (for further optimization you probably want to use FixedString or binary data).
Here I’m cheating to insert my data, because the dataset isn’t public:
INSERT INTO events_unique SELECT date, group_id, client_id, event_type, product_id, uniqState(visitor_id) AS value from events
group by date, group_id, client_id, event_type, product_id;
But it’s homework for you to check this approach on your problem.
Smoke test that it’s working:
select uniqMerge(value) from events_unique group by product_id;
Result on original table:
SELECT uniq(visitor_id) AS c FROM events WHERE (client_id = ‘aaaaaaaa’) AND (event_type = ‘click’) AND (product_id = ‘product1’) AND date >= ‘2017–01–20’ AND date < ‘2017–02–20’;
│ 457954 │
1 rows in set. Elapsed: 0.948 sec. Processed 13.22 million rows, 1.61 GB (13.93 million rows/s., 1.70 GB/s.)
Result on pre-aggregated table:
select uniqMerge(value) AS c from events_unique WHERE (client_id = ‘aaaaaaaa’) AND (event_type = ‘click’) AND (product_id = ‘product1’) AND date >= ‘2017–01–20’ AND date < ‘2017–02–20’;
│ 457954 │
1 rows in set. Elapsed: 0.050 sec. Processed 39.39 thousand rows, 8.55 MB (781.22 thousand rows/s., 169.65 MB/s.)
As a result we got 20x improvement in processing time.
It’s more convenient to use materialized view with AggregatingMergeTree engine instead of manual approach in our example. But it was given for educational purposes. You should look at extensive documentation of AggregatingMergeTree engine(with more examples of uniqMerge and uniqState) — the main force behind Yandex.Metrica.
To sum up
ClickHouse allows you to store aggregated state inside the database, not only in your application, which could lead to interesting performance optimizations and new use cases.
P.S. Regarding feedback it was mentioned that aggregate state is not that unique, and also presents in some other products, for example Snowflake. But I haven’t dealt with Snowflake in my practice.