Optimizing timeseries data in Clickhouse

Published in

DataDenys

4 min readAug 12, 2022

Clickhouse is a powefull solution for generic analytics, but can also be used as a timeseries storage. Though there are plenty of specific timeseries databases available, like Redis TimeSeries, Clickhouse can cope with this task quite well and save time installing other tools. To get the most of Clickhouse when working with timeseries, we need to use custom compression codecs and setup data retention policy, let’s see how to do that…

Timeseries compression

Timeseries data is usually a (big) set of time:value pairs. If we try to save that data in table, we would have structure similar to:

CREATE TABLE timeseries(`ts` DateTime, `value` UInt64)
ENGINE = MergeTree ORDER BY ts

This is a standard Clickhouse table engine, let’s populate this table with 100m rows and see how much space that takes:

INSERT INTO timeseries SELECT now() - number, number FROM numbers(100000000)

This will generate a series of numbers per each second, starting from current time and going into the past. Let’s find out this table size:

SELECT formatReadableSize(sum(bytes))
FROM system.parts WHERE active AND table = 'timeseries';

Which gives us 766Mb.

Compression codecs

As timeseries data is usually pairs of time and value, which both are basically numbers, gzip compression will not be the best case to compress such data. Good approach here, is to store only changes instead of full values, based on the starting value in the whole set:

So instead of saving full value for each data point, we just save delta. It allows using only few bytes (or even bits) to save each value in a series.

Clickhouse supports multiple codecs that leverage this basic approach to compress timeseries data. First is T64, which crops unused bits for integer values (including date/time types). DoubleDelta and Gorilla codecs are efficient when timeseries data changes slightly from point to point (because of operating on delta instead of value itself).

Let’s check how that works compared to standard storage without these codecs. First table will use T64 codec:

CREATE TABLE timeseries_t64
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64) )
ENGINE = MergeTree ORDER BY ts

And another table will use DoubleDelta for time column and Gorilla for value column:

CREATE TABLE timeseries_ddg
(`ts` DateTime CODEC(DoubleDelta), `value` UInt64 CODEC(Gorilla) )
ENGINE = MergeTree ORDER BY ts

Now let’s check how much space they take with the same data compared to uncompressed table:

INSERT INTO timeseries_t64 SELECT * FROM timeseries;
INSERT INTO timeseries_ddg SELECT * FROM timeseries;

As you can see, DoubleDelta/Gorilla compressed data takes 4 times less space than default storage mode, which is insane in terms of hard drive.

You could also notice, that we used very gradual form of data (time changes second-by-second, value changes one-by-one). Let’s do the same test but on a very randomly distributed data points:

INSERT INTO timeseries
    SELECT * FROM generateRandom('ts DateTime, value UInt64')
    LIMIT 100000000;
INSERT INTO timeseries_t64 SELECT * FROM timeseries;
INSERT INTO timeseries_ddg SELECT * FROM timeseries;

generateRandom() function generates randomly distributed data. In such case we can see different picture:

We can still see better performance on T64 and DoubleDelta/Gorilla codecs, but they only reduce used space by 10% and 20% if compared to default storage method. In production we’ll see results closer to first picture, because timeseries usually reflect somewhat-continuous nature of time and values being stored.

Data retention policy

Clickhouse also supports powerfull data retention operations, which are configured during table definition using TTL operator. Let’s say we want to save only 2 days of data:

CREATE TABLE timeseries
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64))
ENGINE = MergeTree ORDER BY ts TTL ts + INTERVAL 2 DAY DELETE

This will ask Clickhouse to remove all data from timeseries table where ts column value is less (older) than 2 days from now.

This can be combined with aggregating materialized views to create compacted versions of timeseries for historical data instead of just removing it (know as compactions in Redis Timeseries):

CREATE MATERIALIZED VIEW timeseries_days
ENGINE = AggregatingMergeTree()
ORDER BY date POPULATE AS
SELECT date(ts) date, SUM(value) value
FROM timeseries GROUP BY date;

This will create timeseris_days materialized view that will automatically aggregate all data from timeseries table based on given query with the help of AggregatingMergeTree engine.

Summary

When storing timeseries data in Clickhouse, special column codecs should be used to allow better data compression:

CREATE TABLE timeseries_ddg
(`ts` DateTime CODEC(DoubleDelta), `value` UInt64 CODEC(Gorilla) )
ENGINE = MergeTree ORDER BY ts

Using AggregatingMergeTree and materialized views allows saving historical data in compacted form. At the same time, combining this with TTL settings for main table to automatically remove old data will allow saving a lot of disk space:

CREATE TABLE timeseries
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64))
ENGINE = MergeTree ORDER BY ts TTL ts + INTERVAL 2 DAY DELETE

Optimizing timeseries data in Clickhouse

Timeseries compression

Compression codecs

Data retention policy

Summary

Written by Denys Golotiuk