Optimizing timeseries data in Clickhouse
Clickhouse is a powefull solution for generic analytics, but can also be used as a timeseries storage. Though there are plenty of specific timeseries databases available, like Redis TimeSeries, Clickhouse can cope with this task quite well and save time installing other tools. To get the most of Clickhouse when working with timeseries, we need to use custom compression codecs and setup data retention policy, let’s see how to do that…
Timeseries compression
Timeseries data is usually a (big) set of time:value
pairs. If we try to save that data in table, we would have structure similar to:
CREATE TABLE timeseries(`ts` DateTime, `value` UInt64)
ENGINE = MergeTree ORDER BY ts
This is a standard Clickhouse table engine, let’s populate this table with 100m
rows and see how much space that takes:
INSERT INTO timeseries SELECT now() - number, number FROM numbers(100000000)
This will generate a series of numbers per each second, starting from current time and going into the past. Let’s find out this table size:
SELECT formatReadableSize(sum(bytes))
FROM system.parts WHERE active AND table = 'timeseries';
Which gives us 766Mb
.
Compression codecs
As timeseries data is usually pairs of time and value, which both are basically numbers, gzip compression will not be the best case to compress such data. Good approach here, is to store only changes instead of full values, based on the starting value in the whole set:
So instead of saving full value for each data point, we just save delta. It allows using only few bytes (or even bits) to save each value in a series.
Clickhouse supports multiple codecs that leverage this basic approach to compress timeseries data. First is T64
, which crops unused bits for integer values (including date/time types). DoubleDelta
and Gorilla
codecs are efficient when timeseries data changes slightly from point to point (because of operating on delta instead of value itself).
Let’s check how that works compared to standard storage without these codecs. First table will use T64
codec:
CREATE TABLE timeseries_t64
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64) )
ENGINE = MergeTree ORDER BY ts
And another table will use DoubleDelta
for time column and Gorilla
for value column:
CREATE TABLE timeseries_ddg
(`ts` DateTime CODEC(DoubleDelta), `value` UInt64 CODEC(Gorilla) )
ENGINE = MergeTree ORDER BY ts
Now let’s check how much space they take with the same data compared to uncompressed table:
INSERT INTO timeseries_t64 SELECT * FROM timeseries;
INSERT INTO timeseries_ddg SELECT * FROM timeseries;
As you can see, DoubleDelta/Gorilla
compressed data takes 4 times less space than default storage mode, which is insane in terms of hard drive.
You could also notice, that we used very gradual form of data (time changes second-by-second, value changes one-by-one). Let’s do the same test but on a very randomly distributed data points:
INSERT INTO timeseries
SELECT * FROM generateRandom('ts DateTime, value UInt64')
LIMIT 100000000;
INSERT INTO timeseries_t64 SELECT * FROM timeseries;
INSERT INTO timeseries_ddg SELECT * FROM timeseries;
generateRandom()
function generates randomly distributed data. In such case we can see different picture:
We can still see better performance on T64
and DoubleDelta/Gorilla
codecs, but they only reduce used space by 10%
and 20%
if compared to default storage method. In production we’ll see results closer to first picture, because timeseries usually reflect somewhat-continuous nature of time and values being stored.
Data retention policy
Clickhouse also supports powerfull data retention operations, which are configured during table definition using TTL
operator. Let’s say we want to save only 2 days of data:
CREATE TABLE timeseries
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64))
ENGINE = MergeTree ORDER BY ts TTL ts + INTERVAL 2 DAY DELETE
This will ask Clickhouse to remove all data from timeseries
table where ts
column value is less (older) than 2 days from now.
This can be combined with aggregating materialized views to create compacted versions of timeseries for historical data instead of just removing it (know as compactions in Redis Timeseries):
CREATE MATERIALIZED VIEW timeseries_days
ENGINE = AggregatingMergeTree()
ORDER BY date POPULATE AS
SELECT date(ts) date, SUM(value) value
FROM timeseries GROUP BY date;
This will create timeseris_days
materialized view that will automatically aggregate all data from timeseries
table based on given query with the help of AggregatingMergeTree
engine.
Summary
When storing timeseries data in Clickhouse, special column codecs should be used to allow better data compression:
CREATE TABLE timeseries_ddg
(`ts` DateTime CODEC(DoubleDelta), `value` UInt64 CODEC(Gorilla) )
ENGINE = MergeTree ORDER BY ts
Using AggregatingMergeTree
and materialized views allows saving historical data in compacted form. At the same time, combining this with TTL
settings for main table to automatically remove old data will allow saving a lot of disk space:
CREATE TABLE timeseries
(`ts` DateTime CODEC(T64), `value` UInt64 CODEC(T64))
ENGINE = MergeTree ORDER BY ts TTL ts + INTERVAL 2 DAY DELETE