Arrays on-the-fly

Published in

Met Office Informatics Lab

5 min readJan 28, 2020

Working in the Informatics Lab often involves working with very large multidimensional datasets. The Pangeo ecosystem has great tools for working with this kind of data (such as xarray and Iris); however, getting to the point of being able to use these tools can be a painful process.

One solution to this problem is a library called Zarr, which is great at providing clean and intuitive cloud-native data access. However, not all the datasets we work with are stored as Zarr. A lot of the datasets we use at the Met Office are stored as NetCDF files. Converting these datasets into Zarr format is a nuisance for supporting current workflows which are based on accessing individual files. It’s also very expensive for large datasets, as all the data has to be streamed through memory when it’s written to the Zarr data store.

What we’d really like to be able to do is construct an array interface on the fly from a bunch of NetCDF files. This feels like it should be possible: we know the file naming conventions, the array shapes they contain, and how they need to be stitched together. But until now we haven’t been able to find a way to make it work cleanly.

Luckily, it turns out Zarr is designed in a way that makes this pretty easy. After playing around a bit, I found I only needed to override three methods to get it working.

Demo

I’ll use a small dataset to demonstrate, of files containing different time steps from a single model run. This means we’ll only need to define chunks over a single dimension. We’ll construct a Zarr array for a single variable (in this case wet bulb potential temperature, or WBPT).

Each WBPT cube has four dimensions: a pressure dimension (defining the altitude) of size three, two spatial dimensions of sizes 548 and 421, and a time dimension.

Unfortunately, the time dimension doesn’t have a consistent size: the first cube (003) contains four time steps, while all the later cubes contain three. This can be handled with a little extra ingenuity, which I may return to in a later post. For simplicity, in this post we’ll pretend the first file doesn’t exist.

Having understood our dataset, we can construct a Zarr Array which reads from it. To do this we will override three methods:

DirectoryStore._fromfile
Array._chunk_key
Array._load_metadata

The _fromfile method takes a string argument and returns a data blob. Instead of reading directly on a file, we get it to use Iris to read in the WBPT cube from each file.

The _chunk_key method takes an array of coordinates and returns a string. We use it to construct the appropriate filename, beginning at 006 and progressing in increments of three.

The _load_metadata method would usually read from a metadata file called .zarray or similar. We just use it to set the attributes for our Array.

And that’s it! Now we can use our patched versions of DirectoryStore and Array to load our data. Notice this constructs an array with length nine in its first dimension out of our three arrays of length three.

Finally, we’ll do a quick plot of some data to make sure it’s being accessed correctly:

Success! We have an array we can work with and index as normal, which can lazily retrieve data from our NetCDF files as needed.

Discussion

It seems we can have our cake and eat it by constructing Zarr arrays on-the-fly on top of existing NetCDF datasets. Even better, this method should enable different users to define quite different Zarr arrays on top of the same underlying dataset.

There are a couple of remaining issues with this approach. Firstly, the method described above will not handle files containing different sized blocks. This should be doable without introducing any major complexity, but it would make the whole thing a bit messier.

Another potential issue which needs further investigation is performance. I haven’t tested this, but it’s likely that replacing the direct data read with an Iris load will slow things down quite a bit. On the other hand, Iris is not integral to the general approach: any other method for reading data from files could be dropped in as appropriate. For example, using the NetCDF4 library to read in data might turn out to be faster.

What’s good about this approach is how it lets us construct a friendly interface to work with very large and relatively messy datasets, without the upfront cost of converting to Zarr or the ongoing cost of storing multiple formats. For this reason, I think it can be a useful tool within the distributed data ecosystem.

Arrays on-the-fly

Demo

Discussion

Written by Rachel Prudden