# VictoriaMetrics: achieving better compression than Gorilla for time series data

Everybody who read Gorilla: A fast, scalable, in-memory time series database paper from Facebook, knows that:

Is it possible to achieve better compression rates for time series data than Gorilla?

# Gorilla compression algorithm

• To group data by individual time series.
• To sort (timestamp, value) data points for each time series by timestamp.
• To encode subsequent timestamps using delta-of-delta encoding. This substitutes big timestamps with small deviation values for the interval between subsequent data points. The deviation values are encoded with smaller number of bits comparing to the original timestamps.
• To XOR binary representation of subsequent floating point values. Gorilla paper clams this usually gives binary values with big number of leading and trailing zero bits, which may be effectively packed into smaller number of bits.

Gorilla gives 3x-8x compression for typical time series data, i.e. it compresses each 16-byte (timestamp, value) data point to 2-5 bytes. The compression ratio varies depending on randomness of the original data — higher randomness results in lower compression ratio.

# Gorilla compression analysis

`0.0, a=0000000000000000, b=3FB999999999999A, a^b=3FB999999999999A0.1, a=3FB999999999999A, b=3FC999999999999A, a^b=00700000000000000.2, a=3FC999999999999A, b=3FD3333333333334, a^b=001AAAAAAAAAAAAE0.3, a=3FD3333333333334, b=3FD999999999999A, a^b=000AAAAAAAAAAAAE0.4, a=3FD999999999999A, b=3FE0000000000000, a^b=003999999999999A0.5, a=3FE0000000000000, b=3FE3333333333333, a^b=00033333333333330.6, a=3FE3333333333333, b=3FE6666666666666, a^b=00055555555555550.7, a=3FE6666666666666, b=3FE9999999999999, a^b=000FFFFFFFFFFFFF0.8, a=3FE9999999999999, b=3FECCCCCCCCCCCCC, a^b=00055555555555550.9, a=3FECCCCCCCCCCCCC, b=3FEFFFFFFFFFFFFF, a^b=00033333333333331.0, a=3FEFFFFFFFFFFFFF, b=3FF1999999999999, a^b=001E666666666666`

As you can see, the majority of `a^b` results have small number of trailing zeros. This means that the compression ratio for typical floating point values isn’t so great as advertised in the Gorilla paper. You can verify the results and play with your own data sets on this page.

# Improving time series compression

`0, a=0000000000000000, b=0000000000000001, a^b=00000000000000011, a=0000000000000001, b=0000000000000002, a^b=00000000000000032, a=0000000000000002, b=0000000000000003, a^b=00000000000000013, a=0000000000000003, b=0000000000000004, a^b=00000000000000074, a=0000000000000004, b=0000000000000005, a^b=00000000000000015, a=0000000000000005, b=0000000000000006, a^b=00000000000000036, a=0000000000000006, b=0000000000000007, a^b=00000000000000017, a=0000000000000007, b=0000000000000008, a^b=000000000000000F8, a=0000000000000008, b=0000000000000009, a^b=00000000000000019, a=0000000000000009, b=000000000000000A, a^b=0000000000000003`

Now `a^b` results are much better from compression point of view — they contain a lot of leading zeros, so they may be effectively packed into smaller number of bits. Verify the table and play with your own data on this page.

How to convert `[0.0 … 1.1]` series with step `0.1` to `[0 … 11]` series with step `1`? Multiply by `10`! Any floating-point series may be converted into integer series by multiplying by `10^N`, where `N` is the maximum number of decimal digits after the point across all the values in the time series. The only problem is the result may exceed 64 bits — default integer size used in modern computers. How to deal with it? Normalize the integer by dividing by `10^M` where `M` is the minimum value that allows fitting all the time series values into 64 bits and removing common trailing decimal zeros.

Why multiplying and dividing by `10^N` instead of `2^N` as in the standard floating-point encoding scheme used in modern computers? Because we are humans and prefer rounding metric values to decimal points, not binary points :) This opens up opportunities for better compression ratio as we saw above.

# Gauges and Counters

• Gauges — arbitrary values that may go up and down at any time — `memory usage`, `cpu usage`, `speed`, `temperature`, `pressure`, etc.
• Counters — non-decreasing sequences such as `total requests processed`, `total bytes read`, `total kilometers passed`, etc.

Counters may be converted to Gauges by applying a single delta-encoding step. It substitutes absolute value by the speed the value grows. Since the speed is usually smaller than the absolute value, this reduces the number of bits required for storing the time series, thus improving compression ratio for Counters.

# Conclusions

• Converting floating-point values to integer values by applying `10^X` multiplier.
• Converting Counters to Gauges by applying delta-encoding.
• Applying general-purpose compression algorithms on top of the encoded data.

These techniques give better compression ratio for VictoriaMetrics comparing to competitors — it compresses typical node_exporter time series data to 0.4 bytes per data point. This is 10x times better than 4 bytes per data point for the same data in Prometheus, which uses the original Gorilla compression algorithm.

P.S. Stay tuned — VictoriaMetrics will be open sourced soon. Both single-node and cluster versions :) In the mean time read other articles about VictoriaMetrics.

Update: we open-sourced VictoriaMetrics!

Written by

Written by

## Aliaksandr Valialkin

#### Founder and core developer at VictoriaMetrics 