Cloud-Performant NetCDF4/HDF5 with Zarr, Fsspec, and Intake

Richard Signell
pangeo
Published in
6 min readDec 14, 2020
Maximum water levels from the Hurricane Ike test case used here. Just a pretty picture. :)

NetCDF4/HDF5 files can be read from cloud storage just as effectively as newer cloud-optimized formats if you just follow a few simple steps…

By Rich Signell (USGS), Martin Durant (Anaconda) and Aleksandar Jelenak (HDF Group)

NetCDF4/HDF5 files, self-describing binary formats for multidimensional data, have existed for many years and are commonly used across science and engineering. They provide for rich metadata and the ability to chunk data into small pieces, often with compression and filtering options that allow data providers to balance file size with performance. NetCDF4/HDF5 files were designed for file system use, and therefore suffer some performance issues when accessed from cloud object storage when used with their standard libraries. The Zarr format was specifically designed to overcome these issues, using the same basic data model as netCDF4/HDF5, but allowing metadata to be stored in a single JSON object, and each chunk to be a separate object. This allows unlimited file sizes and parallel writes as well as reads. As a result, Zarr files are now widely being used for cloud-optimized storage of multidimensional data.

There are still many large collections of netCDF4/HDF5 files, and for certain organizations and projects there may be a mandate for the production of netCDF4/HDF5 files. It would be nice if there were a way to read them more efficiently in the Cloud!

In a previous post, we showed that netCDF4/HDF5 files can be used effectively for analysis in the Cloud by chunking the data appropriately (~10–200MB chunks) and extracting the chunk byte ranges into a metadata file. We then used a modified version of the Zarr library to read chunks in parallel directly from the netCDF4/HDF5 file with no loss in performance over the equivalent Zarr dataset.

While the results were exciting, the approach we used required modifications to the Zarr library, modifications that are still under discussion for inclusion in the official code base.

We recently realized that we can achieve the same functionality by enhancing the Fsspec package, allowing it to create a Mapper that works with the existing Zarr library.

This Fsspec enhancement is a new backend called ReferenceFileSystem, and the format of the metadata, which we previously chose ad hoc, we now propose to take the form of a simple standard:

{
"key0": "data",
"key1": {
["protocol://target_url", 10000, 100]
}
}

where:
key0 specifies data stored as text
key1 refers to (1) data file URL, (2) the offset within the file (in bytes), and the (3) length of the data item (in bytes).

For example, Zarr data can be represented in this specification as:

{
".zgroup": "{"zarr_format": 2"},
".zattrs": "{"Conventions": "UGRID-1.0.0"},
"x/.zattrs": "{"_ARRAY_DIMENSIONS": ["node" ...",
"x/.zarray": "{"chunks": [9228245], "compressor": null, "dtype": "f8", ...",
"x/0": ["s3://bucket/path/file.nc", 294094376, 73825960]
},

With this new enhancement to Fsspec (version 0.8.5), we can use the existing Zarr library to read netCDF4/HDF5 files efficiently in the Cloud. All that is needed is a one-time creation of the above metadata file by a data provider or user. This can be accomplished with a simple script. If the metadata file is stored in the Cloud, reproducible workflows can be constructed referencing just that file, since there are internal references the netCDF4/HDF5 files.

To open a netCDF4/HDF5 file with the Zarr library, the user calls fsspec.get_mapper with “reference://” for the protocol and specifying the location of the metadata file. This returns a mapper object that can be used directly with Zarr. Here’s an example using our Hurricane Ike reference data set:

Opening a NetCDF4/HDF5 file using Xarray, Zarr and Fsspec. Note that only the metadata file is referenced. The location of the actual NetCDF4/HDF5 file is referenced in the metadata.

To make access even easier, we can use Intake, creating a catalog entry that contains the dataset specific code above:

An Intake catalog with two datasets representing the Hurricane Ike simulation: the first using the new ReferenceFileSystem backend for Fsspec, and the second using a standard Zarr format dataset.

This allows a user to simply open an Intake catalog and work with the NetCDF4/HDF5 file and the Zarr dataset in an identical manner:

Using an Intake catalog to open both HDF5 and Zarr datasets with a Dask cluster (30 workers/60 cores) and computing the maximum water level from 53GB of data. The performance with HDF5 is not significantly different than with Zarr.

As expected, the performance with netCDF4/HDF5 is not significantly different that with Zarr. We can also make modifications to the Intake catalog to further optimize performance for both datasets. We note that the chunks in the stored Hurricane Ike data are rather small, only 11.36MB uncompressed:

We therefore can increase the size of the chunks used by the workflow by specifying their sizes in the catalog. Here we specify 30 time steps for each chunk instead of the original 10 steps:

Now when we open the dataset, Xarray shows us a chunk size of 34.07 MB and 30 steps in the time dimension, as expected:

With this larger chunk size, we see a significant increase in performance with execution times dropping from about 30 s to about 20 s:

It’s exciting that we can use the current Zarr and Fsspec libraries with this approach to efficiently access netCDF4/HDF5 files in the Cloud. There is, however, the extra step of creating the metadata file. To facilitate this task we have created an Fsspec reference maker repository which contains the script we used to create the metadata for the Hurricane Ike example used here. This can can be used as a template for creating Zarr compatible metadata from a single netCDF4/HDF5 file. We will be adding an additional script for creating metadata from a collection of netCDF4/HDF5 files, as this is a common way that large datasets are archived.

There is also an example notebook in the repository that reproduces the figures in the blog post here. You can run this notebook on binder right now. Note that for a fair comparison you should let the cluster fully spin up before comparing the two cases. Also, if you run this example in your own environment, you will need AWS credentials since the data is in a “requester-pays” bucket. This also means you will incur a nominal charge for the demo if you are not running in AWS region us-west-2.

There are a few other things you need to know to use this functionality.

You need to install Fsspec≥0.8.5 and s3fs≥0.5.2, and you can use or adapt this environment.yml file for Conda. Also the compression and filter options used in the netCDF4/HDF5 file need to be understood by Zarr. There are some that are currently missing. We used Zlib encoding with no filters for the Hurricane Ike test case, which is supported by both netCDF4/HDF5 and Zarr. Here’s the full encoding:

Encoding in the Hurricane Ike simulation NetCDF4/HDF5 file.

Note that while Zarr is multi-language and the metadata specification is in JSON, only this Python-Fsspec implementation uses it at present.

Looking to the future, we are excited to see how others will use this chunk specification to optimize reading of other types of data in the Cloud. And while we’ve proposed a generic specification that does not depend on changes to the Zarr library, the Zarr core protocol version 3 might formally specify a mechanism for referencing chunks in other datasets. That will not obviate the usefulness of the specification here, which could be used for any library, not just Zarr.

Disclaimer: Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

--

--

Richard Signell
pangeo
Editor for

Research oceanographer turned Pangeo advocate. Sole member of Open Science Computing, LLC