To the cloud and back again

Building a Zarr from bespoke data formats using Xarray and Iris

Kevin Donkers
Met Office Informatics Lab
7 min readApr 6, 2020

--

This blogpost is part of a series on Analysis Ready Data, including representing thousands of geoscience files as a single TileDB dataset.
This research was supported by the
ERDF through the Impact Lab.

Zarr is a data format that has gotten a lot of attention from the data science and geoscience communities of late. We have written a number of blog posts on it ourselves. This blogpost will explore how to convert a large dataset consisting of many files in a bespoke geoscience data format to a single Zarr store.

For those unfamiliar with Zarr, it is a specification for how to store gridded data in a key-value interface (such as Amazon S3 object store), where each chunk of data is a separate value with a corresponding key indicating its position in the full dataset. This allows highly parallel data access where many CPUs can simultaneously read different parts of the same dataset. Zarr is also a Python library implementation of this specification that allows you to read and write data in a Zarr store. The metadata is stored in a single location, requiring only one read to understand how the chunks relate to form a complete dataset. This vastly reduces the number of read operations needed to access data compared to storing large datasets as, for instance, thousands of NetCDF files, where the metadata for each chunk is stored in that chunk.

This has brought cloud-native, highly parallelisable data access to common (geo)science data tools, such as Xarray, improving the ease and speed of processing large datasets. Xarray can also used to load the popular NetCDF and GRIB file formats.

Iris, an open source geoscience tool developed at the Met Office, also reads NetCDF and GRIB files, but more broadly allows us to access less common geoscience data formats like the Met Office specific NIMROD and Post Process (PP).

Can we, using Iris and Xarray, convert 1.2 TB of data stored as 5850 PP files to a Zarr dataset?

Answer: Yes! (with caveats…)

It’s simple, right?

Xarray boasts both a DataArray.from_iris() method and a Dataset.to_zarr() method. Surely the simplest way to convert an Iris-readable file format to Zarr would be:

import iris
import xarray as xr
cube = iris.load_cube('my_file.pp')
da = xr.DataArray.from_iris(cube)
ds = da.to_dataset()
ds.to_zarr('my_zarr_store')

Sloooooow down there, Shirley! There are a number of reasons why working with “real” data isn’t that simple:

  1. Each individual file may logically contain many data variables which cannot be merged into a single Iris Cube, but instead must be loaded into a CubeList.
  2. The original dataset is not one file but many thousands of files. We can’t just load all these files into a single Xarray Dataset then write to a Zarr because this will (unfortunately) explode the memory of our computer.
  3. Xarray Datasets and Zarr stores rely on unique names for coordinate and data variables, but a CubeList can contain many Cubes and coordinates with the same names. Therefore we need to normalise how these variables are named in order to create a working Dataset before we write to Zarr.
    (This step is the most bespoke part of the pipeline — the solution here is very specific to the data you are processing and requires a lot of manual interrogation upfront.)
  4. The size of the data in the original PP files may not be the same as the chunk sizes we want in our Zarr. Therefore we will need to rechunk the data before writing.

Paradigm for storing

With these complexities in mind, we decided that the following paradigm would be the best/simplest way of organising the data and metadata in the Zarr to represent the whole dataset:

  1. Store all the data variables in the same Zarr store, along with their coordinate arrays. If more than one data variable share a common coordinate array, consolidate the metadata so that they point at the same instance of that coordinate array.
  2. If the original PP files contain more than one data variable with the same name, rename these data variables to include any vertical dimension
    (e.g. _at_pressure_) and the statistical method applied to the data
    (e.g. _mean_ ) in the name itself.
  3. If two data variables have coordinates that are named the same but have different values (e.g. the lat-lon grid for wind data is different to that of air temperature), rename the coordinates with a unique name by appending an _n, where n is an integer.
  4. Chunk the data variables such that each Zarr chunk is ~50MB. This means that we will chunk the time coordinate into 200 time steps, keep all the latitude and longitude points together in one chunk, and slice up any vertical axis into size 1 chunks. This ensures that, regardless of whether a data variable has a vertical coordinate or not, the chunk sizes across all data variables will be the same when written to a Zarr store.

There are of course other paradigms you could use, like writing each data variable to a separate Zarr store or interleaving unique coordinates into denser coordinate arrays and mapping all the data variables onto them*. However, we found the paradigm described above worked well for our usecase.

* This latter method is particularly undesirable because it pads most of the data with NaNs where the data variable values don’t map, “bloating” the data.

Method

The preferred paradigm above can be written into a data pipeline (using Python) as follows:

  1. Create a list of all the PP filenames you want to process.
import osfilepath = '/data/model_output/pp_dataset/'
files = sorted(os.listdir(filepath))
FILEPATHS = [os.path.join(filepath, file) for file in files]

2. Load the first few files into an Iris Cubelist.

import iriscubelist0 = iris.load(FILEPATHS[0:4])

3. Rename the cubes and cube coordinates such that they all have unique names. This part is more than a couple of lines long, but straightforward.

# See below for rename_cubelist() function
rename_cubelist(cubelist0)

Code at: gist.github.com/kaedonkers/dc4e5f19ae5855447b62f3b818aff77a

4. Convert into an Xarray Dataset.

import xarray as xrdalist = [xr.DataArray.from_iris(cube) for cube in cubelist0]
ds = xr.merge(dalist)

5. Rechunk the Dataset to fit your Zarr chunking policy.

CHUNKS = {'time': 200, 
'height': 1,
'latitude': 300,
'longitude': 280}
ds = ds.chunk(CHUNKS)

6. Write this Dataset out to a Zarr store.

ZARR_STORE = '/my_zarr_store'
ds.to_zarr(ZARR_STORE, mode='w', consolidated=True)

When writing to a Zarr with Xarray, mode='w' makes sure a new Zarr is written andconsolidated=True means that coordinates common to multiple data variables are written only once, as well as making the metadata for the whole dataset available in the top level .zmetadata file.

6. Repeat steps 2–5 with the remaining filenames, appending to the Zarr store along the time axis.

start = 4
step = 4
stop = len(filepaths)
for i in range(start, stop, step):
cubelist = iris.load(FILEPATHS[i:i+step])
rename_cubes(cubelist)
dalist = [xr.DataArray.from_iris(cube) for cube in cubelist]
ds = xr.merge(dalist).chunk(CHUNKS)
ds.to_zarr(ZARR_STORE, append_dim='time', consolidated=True)

Here is a Jupyter Notebook example of the pipeline used to convert nine years of hourly frequency data stored as 329 files:

Code at: gist.github.com/kaedonkers/028a70aa439ae7b0d4d9e2429990eb3b

And back again

Great, now the data is in Zarr. How do we use it? Thankfully, Xarray can read as well as write Zarrs.

ds_z = xr.open_zarr(ZARR_STORE)

If you prefer to use Iris for your data analysis you can pull out a DataArray from the Dataset and convert that to an Iris Cube:

cube_z = ds_z['surface_pressure'].to_iris()

Or convert the whole Xarray Dataset to an Iris Cubelist:

cubelist_z = iris.cube.CubeList()for da in ds_z.data_vars:
cubelist_z.append(ds_z[da].to_iris())

Note that iterating over xr.Dataset.data_vars returns the names of the data variables as strings, not the data variables themselves.

Issues and improvements

Through developing this data pipeline we came across the following issues:

  • Time gaps
    Unlike loading files with Iris, Xarray doesn’t check for dimension contiguity when converting from an Iris Cube or writing to Zarr, so you will need to do that yourself if contiguity is important. We didn’t build time gap checking into our pipeline, but this is probably how you would do it:
# Pseudo(-ish)code
for time in timesteps:
if time in cube:
ds = xr.DataArray.from_iris(cube).to_dataset()
else:
ds = nan_dataset(time)

ds.to_zarr('my_zarr_store', append_dim='time')
  • xr.DataArray.from_iris()
    The fact that Xarray developers have built in two-way conversion to/from Iris is both a testament to their collaborative thinking and the benefits of working in an open source, scientific community like Pangeo. The data and core metadata is converted really well, but we noticed a couple of details that are missed:
    - Coordinate bounds are lost. Xarray devs are aware of this one.
    - Time coordinated are converted from CFTime in Iris to datetime objects in Xarray.
    - Coordinate systems on spatial coordinates are lost, meaning that projection information of model data is not retained. See this issue on why representing coordinate reference systems in Xarray is non-trivial.
  • xr.merge()
    Scalar coordinates in Iris (e.g height) are a common metadata variable included in cubes. However, when using xr.merge() on a list of DataArrays converted from Iris Cubes, the scalar coordinates on the original cubes are now attributed to all of the data variables, not just the ones they originally represented.
Spot the difference
  • xr.Dataset.to_zarr()
    We encountered an error when appending an Xarray Dataset to a Zarr with a group.
  • Inconsistent file sizes:
    Something we did not cover here is that while the vast majority of the original PP files consistently represented ten days of data, a handful of them (~12 in 5850) represent more or fewer than ten days of data. This meant that writing a PP file to the Zarr could never be guaranteed to “line-up” with the chunking pattern, leaving a leading incomplete chunk being shuttled along by subsequent writes. Zarr and Xarray deal with this gracefully for us, but it limited our ability to parallelise the write process without encountering collisions. This may be the subject of a future blogpost.

Not so simple then…

But not too complicated either. So long as you understand your data enough to make judicious decisions about how to structure the metadata in a Zarr and work smoothly with Iris and Xarray, it is straightforward to do. And while it is a non-trivial amount of work, it may well save you (and your users) time and effort in the long run.

Many thanks to Rich Signell for his insight and example pipeline, which heavily influenced the method described in this blogpost.

--

--

Kevin Donkers
Met Office Informatics Lab

Research technologist at the Met Office and PhD student with the Environmental Intelligence CDT, University of Exeter. Culinary adventurer and bicycle fiend.