Data Compression in Aerospike

Rahul Ranjan
Wind of Change
Published in
3 min readJan 30, 2022
AS Cache in an application. Source

Aerospike is a fast open source distributed key value NoSQL database management system that provides both persistent and in-memory storage. Persistent storage involves data to be written to SSDs. To optimise this storage, aerospike comes with different lossless compression strategies like LZ4, Snappy, ZStandard. But here is the catch. Compression strategies are available only for persistent cache. Then what about in-memory caches?

Well, that my friend is upto us.

In-memory caches imply that instead of writing cache data to some external storage device, it would stay in main memory. This facilitates low latency and high data throughput for applications, making the strategy increasingly common in large-scale applications and micro services. Hence it is advised to have space optimisation there as well otherwise the system would choke at high load.

Parameters to consider while selecting the compression algorithm:

  • Data Type → All compression algorithms do not yield a similar level of optimisation for different data types.
  • Data usage frequency → It is important to understand if the data to compress is read heavy data or a mixture of read-write operations. In the former case, we can choose an algorithm like BZip that gives an excellent compression ratio but at the cost of high compression time.
  • Latency → If we have static data, i.e, like the homepage of your app, we can use extreme levels of compression. This will definitely affect the compression latency metric but one can accommodate this approach as compression would be a one time activity. Simultaneously, it should not increase the decompression latency as this would lead to a latent application and a degraded user experience.
  • System resources → Compression/decompression is a resource(CPU, heap, memory) intensive operation. Tracking these metrics is also important.
  • Compression level → Compression algorithms like GZip and Brotli provided several levels of compression. The idea is to provide the developer the flexibility to select a level of compression that suits the application depending upon latency metrics, system resource usage and compression ratio.

To achieve this, we need to compress data at the application layer before writing it to the cache. We can employ any one of the several compression strategies available.

I ran some tests to compare the performance of these algorithms on a sample JSON data. We selected GZip with compression level 6 as a baseline since it is the most widely used compression strategy and compared LZ4, Snappy, GZip Huffman, GZip Filtered and GZip Best Compression against it. I created a multi-threaded scheduler that is consistently reading data from a json file and compressing and decompressing it. JSON file size is ~108K.

Compressed data
p50 compression & decompression latencies
p90 compression & decompression latencies

PS: BZip gave a very good compression ratio for our sample JSON data but the compression latencies for the same is not sustainable. It varies from 100 to 10,000 times the latencies of other strategies.

--

--