To Compress or Not to Compress — A Zarr Question

4 min readApr 23, 2023

This article relies on you already knowing what the Zarr library is about. I wrote about it a few months ago.

If you have used Zarr, you know how easy Zarr makes it to compress and decompress your data — compression is seamlessly integrated and requires absolutely no change in your code or usage. But should you use it?

TL,DR: Compression is worth it if your disk has a slow read speed (bandwidth) e.g. a spinning hard-disk, network disk, cold storage option, and your data is ≥2x compressible. If your architecture employs Solid-State Drives or better, storing data uncompressed allows for faster access.

Let’s dive into the experiments. Bear with me as I write a bit of context for the results to make sense.

The Data

Image Credit: International Energy Agency

My data is boundary-layer turbulence data from the National Center for Atmospheric Research. The data will soon be available on the Johns Hopkins Turbulence Database, to which I am affiliated. Each point in 3D space is described by 6 variables: Energy, Pressure, Temperature and Velocity directions. These variables are stored as separate arrays so we have 6 3D arrays of 32-bit floating points — 32GB each. We are exclusively interested in read speed.

Turbulence Visualization (Velocity Component u)

Experiment Method — Sequential Chunk-Aligned Read-Only Access

In my experiments, I randomly pick a chunk to start from, then sequentially read increasingly large cubes of data, where each edge of the cube is of length cube_root_length. The cubes are of size 256³ and each array is 2048³. I picked 3 of the 6 variables that have significantly different compression ratios:

Arrays Information. Image Credit: Author

Energy field (‘e’) is most compressible at 2.5 Storage Ratio. This should see the most benefit in read speed from Zarr compression
Temperature (‘t’) is moderately compressible at 1.8
w-Velocity is the w-component of the velocity vector (velocity in 3D is described by a vector of length 3, which are stored separately). This is the least compressible at 1.3.

Terminology — Cold vs. Warm Access

Cold access experiments are meant to reflect first-time read performance. Since data that you access is cached by the OS and filesystem, repeatedly reading the same file is multiple times faster. To avoid this for the Cold Access experiments, I read 32GB of unnecessary data (one of the other data arrays) before each experiment. This did not remove all caching effects, but the results are only 20% faster than a truly fresh access, and most importantly, they are consistent across experiments.
Warm access is what you would expect: it is repeatedly reading the same piece of data.

Results — Cold Access

Note how the blue line (Compressed access time) shifts above the orange one (Uncompressed) as you scroll through the images (which shows the effect of lowering the compression ratio of an array).

Energy - 2.5 Storage Ratio. Image Credit: Author

Temperature - 1.8 Storage Ratio. Image Credit: Author

w-Velocity - 1.3 Storage Ratio. Image Credit: Author

Results — Warm Access

The results show Compressed being slightly slower (or just as fast) for repeated (warm) access. This is because cached data has a much higher read speed than reading from disk. Since compression reduces read time (there’s less data to read) but introduces a decompression overhead, we expect the benefits of Compression to diminish as disk read speed increases.

Summary

These experiments show the tradeoff of storing compressed vs. non compressed Scientific data using Zarr.

Compression is worth it if your disk has a slow read speed (bandwidth) e.g. a spinning hard-disk, network disk, cold storage option. If your architecture employs Solid-State Drives or better, storing data uncompressed allows for faster access.

Here’s the notebook I used for the experiments — link