VictoriaMetrics: achieving better compression than Gorilla for time series data
Everybody who read Gorilla: A fast, scalable, in-memory time series database paper from Facebook, knows that:
- Time series data may be compressed in order to save RAM space, disk space and disk bandwidth.
- Gorilla compression became standard in TSDB world — Prometheus uses Gorilla, InfluxDB uses Gorilla, M3 is inspired by Gorilla. The only outlier here is TimescaleDB, which doesn’t compress time series data yet.
Is it possible to achieve better compression rates for time series data than Gorilla?
Gorilla compression algorithm
Let’s look at the original Gorilla algorithm in order to understand its’ strengths and weaknesses. The algorithm consists of the following steps:
- To group data by individual time series.
- To sort (timestamp, value) data points for each time series by timestamp.
- To encode subsequent timestamps using delta-of-delta encoding. This substitutes big timestamps with small deviation values for the interval between subsequent data points. The deviation values are encoded with smaller number of bits comparing to the original timestamps.
- To XOR binary representation of subsequent floating point values. Gorilla paper clams this usually gives binary values with big number of leading and trailing zero bits, which may be effectively packed into smaller number of bits.
Gorilla gives 3x-8x compression for typical time series data, i.e. it compresses each 16-byte (timestamp, value) data point to 2-5 bytes. The compression ratio varies depending on randomness of the original data — higher randomness results in lower compression ratio.
Gorilla compression analysis
The Gorilla compression algorithm looks solid except of the last step — XOR’ing subsequent floating point values. Let’s look on XOR results for subsequent values in the range
[0.0 … 1.1] with step
0.0, a=0000000000000000, b=3FB999999999999A, a^b=3FB999999999999A
0.1, a=3FB999999999999A, b=3FC999999999999A, a^b=0070000000000000
0.2, a=3FC999999999999A, b=3FD3333333333334, a^b=001AAAAAAAAAAAAE
0.3, a=3FD3333333333334, b=3FD999999999999A, a^b=000AAAAAAAAAAAAE
0.4, a=3FD999999999999A, b=3FE0000000000000, a^b=003999999999999A
0.5, a=3FE0000000000000, b=3FE3333333333333, a^b=0003333333333333
0.6, a=3FE3333333333333, b=3FE6666666666666, a^b=0005555555555555
0.7, a=3FE6666666666666, b=3FE9999999999999, a^b=000FFFFFFFFFFFFF
0.8, a=3FE9999999999999, b=3FECCCCCCCCCCCCC, a^b=0005555555555555
0.9, a=3FECCCCCCCCCCCCC, b=3FEFFFFFFFFFFFFF, a^b=0003333333333333
1.0, a=3FEFFFFFFFFFFFFF, b=3FF1999999999999, a^b=001E666666666666
As you can see, the majority of
a^b results have small number of trailing zeros. This means that the compression ratio for typical floating point values isn’t so great as advertised in the Gorilla paper. You can verify the results and play with your own data sets on this page.
Improving time series compression
The obvious improvement is to convert floating point values to integers before applying XOR encoding. The following table contains XOR results for subsequent integer values in the range
[0 … 11] with step
0, a=0000000000000000, b=0000000000000001, a^b=0000000000000001
1, a=0000000000000001, b=0000000000000002, a^b=0000000000000003
2, a=0000000000000002, b=0000000000000003, a^b=0000000000000001
3, a=0000000000000003, b=0000000000000004, a^b=0000000000000007
4, a=0000000000000004, b=0000000000000005, a^b=0000000000000001
5, a=0000000000000005, b=0000000000000006, a^b=0000000000000003
6, a=0000000000000006, b=0000000000000007, a^b=0000000000000001
7, a=0000000000000007, b=0000000000000008, a^b=000000000000000F
8, a=0000000000000008, b=0000000000000009, a^b=0000000000000001
9, a=0000000000000009, b=000000000000000A, a^b=0000000000000003
a^b results are much better from compression point of view — they contain a lot of leading zeros, so they may be effectively packed into smaller number of bits. Verify the table and play with your own data on this page.
How to convert
[0.0 … 1.1] series with step
[0 … 11] series with step
1? Multiply by
10! Any floating-point series may be converted into integer series by multiplying by
N is the maximum number of decimal digits after the point across all the values in the time series. The only problem is the result may exceed 64 bits — default integer size used in modern computers. How to deal with it? Normalize the integer by dividing by
M is the minimum value that allows fitting all the time series values into 64 bits and removing common trailing decimal zeros.
Why multiplying and dividing by
10^N instead of
2^N as in the standard floating-point encoding scheme used in modern computers? Because we are humans and prefer rounding metric values to decimal points, not binary points :) This opens up opportunities for better compression ratio as we saw above.
Gauges and Counters
There are two main types of time series exist:
- Gauges — arbitrary values that may go up and down at any time —
- Counters — non-decreasing sequences such as
total requests processed,
total bytes read,
total kilometers passed, etc.
Counters may be converted to Gauges by applying a single delta-encoding step. It substitutes absolute value by the speed the value grows. Since the speed is usually smaller than the absolute value, this reduces the number of bits required for storing the time series, thus improving compression ratio for Counters.
Using general-purpose compression
The scheme above provides better compression ratio for floating-point values comparing to the original Gorilla algorithm. But the compression ratio may be improved even more by applying general-purpose compression on top of the encoded values. General-purpose compression algorithms such as zstd are good at compressing repeated patterns and low-entropy data. Such patterns and data is quite common after applying Gorilla-like encoding to time series data. The only downside is increased CPU usage required for general-purpose compression. But usually the increased CPU usage is negligible comparing to CPU usage spent in other parts of the time series database code.
Gorilla compression from Facebook may be improved using simple techniques described above:
- Converting floating-point values to integer values by applying
- Converting Counters to Gauges by applying delta-encoding.
- Applying general-purpose compression algorithms on top of the encoded data.
These techniques give better compression ratio for VictoriaMetrics comparing to competitors — it compresses typical node_exporter time series data to 0.4 bytes per data point. This is 10x times better than 4 bytes per data point for the same data in Prometheus, which uses the original Gorilla compression algorithm.
Update: we open-sourced VictoriaMetrics!