Using Kerchunk with uncompressed NetCDF 64-bit offset files: Cloud-optimized access to HYCOM Ocean Model output on AWS Open Data

Richard Signell
pangeo
Published in
4 min readJul 19, 2024
A global slice of Sea Water Temperature from the HYCOM Reanalysis, extracted directly from AWS Open Data S3 object storage in about 6s using a single CPU.

By Rich Signell (Open Science Computing, LLC) and Martin Durant (Anaconda)

Several years ago we wrote about a cloud-optimized approach to reading NetCDF4/HDF5 files from object storage using Kerchunk. The Kerchunk approach is now being used widely for improving access to datasets stored in NetCDF4/HDF5, GRIB2, and GeoTIFF, which all store arrays as compressed chunks of data.

In this post we explore using Kerchunk to read data that is not compressed: the 63,341 NetCDF 64-bit Offset Format (CDF-2) files (285TB) that constitute the HYCOM reanalysis dataset on AWS Open Data.

Generating the Virtual Dataset

Here we describe the notebook we used to generate the references for the virtual dataset. If you just want to access the virtual dataset see the next section, Accessing the Virtual Dataset.

Although CDF-2 files are not explicitly chunked, the entire data for each variable in each file is like a chunk, since it resides in a single byte-range. For HYCOM, 3D data variables are dimensioned [time:1, lat: 3251, lon: 4500] and 4D variables dimensioned [time:1, depth: 40, lat: 3251, lon: 4500]. The data is stored as short integers (with scale_factor and add_offset attributes), so the 3D variable chunks in each single-time-value file have a size of [3251 * 2500 * 2 bytes]/1e6 = 32.26MB, while the 4D variable chunks are 40x larger, 1.17GB.

These large chunks are too large to be read efficiently (chunks of size 10–200MB are preferred), but because the data is not compressed, we can split the data into smaller chunks virtually by creating references to “subchunks”. Using the kerchunk.utils.subchunk function, we split the 4D variables along vertical levels, so that each subchunk is a single layer. This makes it possible to read both 3D and 4D variables in 32.26MB chunks, and makes parallel reading of 4D variables with the Zarr library much more efficient. (Note there were some changes necessary to the subchunk function to allow it to work on the first non-singleton dimension, so kerchunk≥0.2.6 is required).

For this dataset, the references for each file are identical, except for the time coordinate value and the referenced URL. We can take advantage of this to generate a single Python reference dict object for the first file and then use it as a template for the rest of the 63,341 files. The result is that we can generate references for all the files with a single CPU in less than 30s!

The next step is generating the combined references and utilizing Kerchunk’s Parquet storage format, to both save space and make reading more efficient. Because we don’t have write access to the HYCOM bucket on AWS, we push the references to ESIP’s bucket on Open Storage Network which has similar advantages to AWS Public Data — users don’t need credentials and access to the references are free (no egress fees).

The virtual Xarray dataset resulting from the combined, sub-chunked references.

Accessing the Virtual Dataset

The virtual dataset can be opened using Xarray’s kerchunk
engine, but we can simplify the access process for users by creating an Intake catalog. In our notebook that creates the Intake version 2 catalog, we open the dataset using the kerchunkengine and then create the catalog programmatically.

In the example data access notebook, we first demonstrate that accessing a global field at a specific level and time is fast, because you only need to load one chunk of data — no cluster required!

While it’s fast to load a field of data at a particular level, there is no way to overcome the fact (without rechunking the data and creating a new copy) that there is a single time step per file, and therefore a single time step per chunk.

To load a time series at a point, or any other subset which requires numerous chunks of data to be loaded, a Dask cluster can be used to parallelize the process and speed things up.

In the example access notebook, we use a Coiled.io cluster with 100 small workers (200 threads) on AWS in the region where the data resides (us-west-2) to extract a complete time series at a specific geographic point. Running in parallel, this process extracts a single data value from each of the 63,341 files in about 8 minutes. We can speed this up by telling the Dask workers to load multiple time steps for each task. If we open the dataset with chunks={'time':20} we can do the same work in under 3 minutes, at a cost of about 20 cents. Specifying larger chunks doesn’t speed things up, and actually causes problems, because we start to exceed the memory capacity of the small cheap instance types (t4g.small instances, ARM processors with 2GB RAM and two threads).

Full time series extracted at a specific longitude, latitude location from the Kerchunk-generated virtual dataset. We used a Coiled.io generated Dask cluster of 100 workers (200 threads) in the same region as the data (us-west-2). Reading chunks of data in parallel, the task was completed in under 3 minutes at a cost of about 20 cents.

There is a huge benefit to being able to create clusters in the region where the data resides, making the workflow both faster and cheaper. This is something that Coiled.io makes very easy.

The non-rendered notebooks, along with the conda environment files that were used to create the client notebook and coiled cluster environments are on this github repo.

We hope this post is useful not only for users of HYCOM data but for others who seek to make reading of datasets in object storage more efficient and effective.

--

--

Richard Signell
pangeo
Editor for

Research oceanographer turned Pangeo advocate. Sole member of Open Science Computing, LLC