Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library

Richard Signell
pangeo
Published in
7 min readFeb 26, 2020

Rich Signell (USGS), Aleksandar Jelenak (HDF Group) and John Readey (HDF Group)

The Zarr format is gaining popularity for storing multidimensional array data on the Cloud, in part because reading Zarr data with the Zarr library is significantly faster than reading NetCDF4/HDF5 data with NetCDF4/HDF5 libraries. Zarr stores each chunk of a dataset as a separate object in Cloud object storage, making it efficient for clusters of CPUs to access the data in parallel. It also allows all the metadata to be in a single location which requires just one read.

The Zarr library is used to access multiple chunks of Zarr data in parallel. But what if Zarr were used to access multiple chunks of an HDF5 file via byte-range requests in parallel? And what if the HDF5 chunk locations were stored in the Zarr metadata, thus requiring just one read? If we did these two things could we access HDF5 format data as efficiently as Zarr format data using the Zarr library?

The answer from our initial testing is: YES!

We started with an existing ocean simulation in NetCDF4 format (which is an HDF5 file with NetCDF4 conventions) containing a 53GB (28GB compressed) water level variable split into 11MB chunks (6MB compressed). We used Xarray to convert the dataset to Zarr format using the same chunking, compression and filter options. We then created the augmented Zarr metadata with the chunk locations and tried reading both the Zarr and HDF5 data using Xarray and a Dask cluster with 60 CPUs.

Here’s the Zarr library reading Zarr format:

Zarr library reading Zarr format data.

And here’s the Zarr library reading HDF5 format:

Zarr library reading NetCDF4/HDF5 format data.

The time it takes to open both Zarr and HDF5 datasets is short (less than a few seconds) and the read access times between the methods are about the same (22–24 seconds). In fact, running the notebook repeatedly over several days reveals there is more difference observed over time than between the two methods.

If we read the HDF5 file using the standard HDF5 library (using h5py), it takes about twice as long to read the data (41 seconds).

HDF5 library reading NetCDF4/HDF5 format data.

The Zarr-HDF5 connector was enabled by slight modifications to both the Zarr and Xarray libraries. We added a FileChunkStore class to the Zarr library that enables reading chunks from formats other than Zarr (here HDF5). And we made a tiny modification to Xarray to enable the chunk_store argument on the open_zarr method.

The FileChunkStore is a general concept that could be used not only for NetCDF4/HDF5 data, but for any data store. All that is needed are the chunk locations to enable the byte-range requests. For HDF5, this was greatly facilitated by the chunk query API feature introduced in the HDF5 1.10.5 library. The only use of the HDF5 library in this example was in the python script that generated the augmented .zmetadata file that the Zarr library reads (look for JSON keys with .zchunkstore for chunk file locations). When we access the data, we are just grabbing ranges of bytes using the Zarr interface, so nothing from the HDF5 Library is being used. This means we don’t have HDF5 Library issues with concurrent multi-threaded reads.

The full notebook example is in the “ike_3ways” Jupyter notebook which can be run on the Pangeo AWS Binder, which gives you the ability to start a dask-kubernetes cluster.

The links to the modified Zarr and Xarray libraries can be found in the Binder environment.yml file.

If you want to try this approach with your own NetCDF/HDF5 data, you can create your own chunk store .zmetadata by running this script, just substituting your own file location. This script requires the modified Zarr library. In our example it took about 25s to create a 0.5MB enhanced metadata file from the 28GB HDF5 file.

It appears that for this test case, many CPUs reading concurrently from the same object is as efficient as reading from separate objects. In addition to this test, we tried another dataset with eight data variables instead of one and bigger chunk sizes (about 100MB). The results were the same: reading Zarr data and reading NetCDF4/HDF5 data with the Zarr library were about the same while reading NetCDF4/HDF5 data with the HDF5 library was twice as slow.

We have only tested this approach on AWS, but based on maximum storage request rates, it should work on Google Cloud Storage and Microsoft Azure as well. In our example each request takes about 350ms (see the Dask Performance Report for more detailed information), so with 60 CPUs, we have about 170 requests per second. Even if the requests took only 100ms, cluster sizes of up to 550 CPUs could be used before hitting the Amazon maximum limit of 6000 requests per second. The maximum request rate limit for other cloud providers is currently 5000 per second for Google Cloud Storage and 500 per second for Microsoft Azure Blob Storage, both comfortably larger than our test request rate of 170 per second.

It’s worth pointing out that reading metadata in one shot and then doing byte-range requests to read the chunks is the same pattern used by Cloud-Optimized GeoTIFF, the de facto standard for storing 2D geographic raster data on the Cloud, and the cloud-optimized format that scored the highest on a recent EOSDIS study. Here we are simply applying the approach to N-dimensional array data, allowing direct access to the data in Cloud storage, without using any services (except S3). This is in contrast to existing approaches to cloud-performant HDF5 data like the HDFGroup HSDS service and the Hyrax data service.

We are enthusiastic about the possibility to access NetCDF4/HDF5 files in a cloud-performant way without a service in the middle, however some existing NetCDF4/HDF5 datasets may not perform well if the chunk sizes are too small. If chunk sizes are less than a few MB, the read times could drop to less than 100ms, which will lead to Dask performance issues. The Pangeo project has found that chunk sizes in the range of 10–200MB work well with Dask reading from Zarr data in cloud storage. And Amazon’s Best Practices for S3 recommends making concurrent requests for byte ranges of an object at the granularity of 8–16 MB. Re-chunking of existing NetCDF4/HDF5 files may be accomplished using Xarray, or using C-based programs that can be run from the command line such as the NCO utility ncks, the NetCDF utility nccopy or the HDF5 utility h5repack. Of course new NetCDF4/HDF5 files can be created with cloud-appropriate chunk sizes.

One might ask, if we need to rechunk the data, why not just write Zarr? The Zarr format has the benefit of being a simple, performant format that is being developed bottom-up, by the community, for the community. And Unidata is currently working to add Zarr as a storage format for the NetCDF C library, which will allow any NetCDF-based code to be able to read and write Zarr. Currently, however, it seems useful to have an approach to read NetCDF4/HDF5 files in the Cloud comparable in performance to the Zarr format, and that decouples the Zarr interface from the Zarr data format. We intend to use the approach to store appropriately chunked NetCDF/HDF5 files on Cloud storage, which will make them accessible to existing software that can read NetCDF and HDF5, including NetCDF-Java based services such as THREDDS and ERDDAP.

The flexibility and simplicity of the Zarr library allows creative modifications without significant effort. We only added the concept to use specified chunk locations. And in the blog post “Arrays on the Fly”, Rachel Prudden from the Met Office Informatics Lab was able to just override a few Zarr methods to grab data from a collection of NetCDF files. The capabilities may eventually be added to the HDF5 C Library, but this will take significant effort and time, while the Zarr library made this experiment straightforward. The HDF Group performed this USACE/USGS-contracted work in less than 2 weeks.

We have only explored NetCDF4/HDF5 access with Xarray and we have not addressed collections and grouped data, so these tests may not expose other potential issues.

We still need to work with the Zarr and Xarray communities to figure out the best way to incorporate this functionality in the existing libraries, deal with collections of NetCDF4/HDF5 files and more, but we are excited by this initial experiment.

--

--

Richard Signell
pangeo
Editor for

Research oceanographer turned Pangeo advocate. Sole member of Open Science Computing, LLC