The World’s First Middle-Out Compression for Time-series Data — Part 2
Since the dawn of time mankind hath sought to make things smaller…
In the previous part we presented the results of our new Middle-out compression algorithm for time-series data. Today we are going to dive deep into the technical details.
The Middle-out Part
We all know, how the Middle-out compression works at the HBO show , but how do we actually apply this to the time-series data?
Take a look at the following schema. There is an input vector of a data. This vector is divided to Middle-out segments. For simplicity, in the image there are only four of a segments, but compression actually uses eight. This is because the AVX-512 vector instructions can process up to 8 doubles at once. If we used a single precision numbers, then we’d have 16 Middle-out segments and proceed 16 elements at a time.
The first elements of each segment (denoted by blue) are referencing values. These values are stored as-is in the front of a compressed data.
As long as we have multiple referencing points, we can compute diffs between the following values in parallel. That allows us to iterate through all Middle-out segments simultaneously, in the same block of the SIMD instructions. Within each iteration, the current values are XORed with the previous ones and the results of this operation are stored in one data block.
If there is a reminder of the division of the length of the input vector by the count of Middle-out blocks, values from that reminder are stored uncompressed at the end of compressed data.
Of course the uncompressed header and footer add some space overhead, but we assume that the overhead would be amortized among at least a few thousands values.
Each data block starts with a bit mask determining what values are the same as the previous ones. If the corresponding bit is set, then we know that the value didn’t change. If all of the bits are set, no further information is stored — all values are the same as the previous ones. In that case, we are able to store an unchanged value in just single bit.
Next, we store the right offsets (trailing zeros) rounded down to bytes. As long as we are addressing whole bytes, we need only 3 bits to store that offset. Only the offsets respective to the changed values are stored. Then the max length (within this block) of the non-zero XORed value is stored. This length holds bytes as well, and we know the max length could not be 0; that would mean all the values are the same as the previous ones and we wouldn’t store this information at all. Hence we need to store only lengths from 1 to 8 that can be stored in 3 bits. We store length only once per block, because we assume that all compressed data would have, in general, the same characteristics, i.e. changing by approximately the same value. All these values encoded in 3 bits (offsets and length) are conjoined and an optional padding is inserted if needed to reach the byte boundary.
Lastly, the XORs parts are stored at the end of the the data block.
How This Compression Benefits from the AVX-512
We have two implementations, one using vector instructions and the other don’t. And we see AVX-512 offers great speedup, in fact, it is the first instruction set which offers all vector instructions essential for this algorithm.
First of all, it is a vector leading zeros count instruction. AVX2 does not offer such instruction, so the data would have to be moved back and forth between scalar and vector registers.
Gather instructions (loading data to a vector from a different locations of a memory) was already introduced in the AVX2, but its counterpart — Scatter (storing data) has finally arrived and been incorporated in the AVX-512. Did you know that the scatters can actually be used as compression instructions? One simple trick is to write to overlapping addresses. It is guaranteed that these stores will be executed in one particular order. Not using this optimization, compression crossing vector elements boundaries would be far more expensive.
The next step forward in the AVX-512 is a support for masked instructions with a dedicated mask registers. Bit mask allows to specify elements on which computation should be performed. This allows us to ged rid of a unpredictable branching in performance critical loops.
There is even an instruction actually called compress. This instruction contiguously store only selected elements from a source vector to another vector, or to a memory location.
We tuned our implementations to squeeze out every bit of performance. We made sure not even a single instruction is wasted. For instance, vector implementation of the inner loop for decompression has only about 80 instructions total. That is about 10 CPU instructions per a decompressed value.
As it turns out, the middle-out approach can have a significant benefits in data processing and we are going to investigate other areas where it could be applied. If You have any challenging task that could benefit from the vector instructions, feel free to contact us.
Source codes of the vectorized and the scalar implementation can be found on the Github .