Continuously extending Zarr datasets

David Brochart
Apr 18 · 4 min read

The Pangeo Project has been exploring the analysis of climate data in the cloud. Our preferred format for storing data in the cloud is Zarr, due to its favorable interaction with object storage. Our first Zarr cloud datasets were static, but many real operational datasets need to be continuously updated, for example, extended in time. In this post, we will show how we can play with Zarr to append to an existing archive as new data becomes available.

The problem with live data

Earth observation data which originates from e.g. satellite-based remote sensing is produced continuously, usually with a latency that depends on the amount of processing that is required to generate something useful for the end user. When storing this kind of data, we obviously don’t want to create a new archive from scratch each time new data is produced, but instead append the new data to the same archive. If this is big data, we might not even want to stage the whole dataset on our local hard drive before uploading it to the cloud, but rather directly stream it there. The nice thing about Zarr is that the simplicity of its store file structure allows us to hack around and address this kind of issue. Recent improvements to Xarray will also ease this process.

Download the data

Let’s take TRMM 3B42RT as an example dataset (near real time, satellite-based precipitation estimates from NASA). It is a precipitation array ranging from latitudes 60°N-S with resolution 0.25°, 3-hour, from March 2000 to present. It’s a good example of a rather obscure binary format, hidden behind a raw FTP server.

Files are organized on the server in a particular way that is specific to this dataset, so we must have some prior knowledge of the directory structure in order to fetch them. The following function uses the aria2 utility to download files in parallel.

Create an Xarray Dataset

In order to create an Xarray Dataset from the downloaded files, we must know how to decode the content of the files (the binary layout of the data, its shape, type, etc.). The following function does just that:

Now we can have a nice representation of (a part of) our dataset:

And plot e.g. the accumulated precipitation:

Store the Dataset to local Zarr

This is where things start to get a bit tricky. Because the Zarr archive will be uploaded to the cloud, it must already be chunked reasonably. There is a ~100 ms overhead associated with every read from cloud storage. To amortize this overhead, chunks must be bigger than 10 MiB. If we want to have several chunks fit comfortably in memory so that they can be processed in parallel, they must not be too big either. With today’s machines, 100 MiB chunks are advised. This means that for our dataset, we can concatenate 100 / (480 * 1440 * 4 / 1024 / 1024) ~ 40 dates into one chunk. The Zarr will be created with that chunk size.

Also, Xarray will choose some encodings for each variable when creating the Zarr archive. The most special one is for the time variable, which will look something like that (content of the .zattrs file):

It means that the time coordinate will actually be encoded as an integer representing the number of “hours since 2000–03–01 12:00:00”. When we create new Zarr archives for new datasets, we must keep the original encodings. The create_zarr function takes care of all that:

Upload the Zarr to the cloud

The first time the Zarr is created, it contains the very beginning of our dataset, so it must be uploaded as is to the cloud. But as we download more data, we only want to upload the new data. That’s where the clear and simple implementation of data and metadata as separate files in Zarr comes handy: as long as the data is not accessed, we can delete the data files without corrupting the archive. We can then append to the “empty” Zarr (but still valid and appearing to contain the previous dataset), and upload only the necessary files to the cloud.

One thing to keep in mind is that some coordinates (here lat and lon) won’t be affected by the append operation. Only the time coordinate and the DataArray which depends on the time dimension (here precipitation) need to be extended. Also, we can see that there will be a problem with the time coordinate: its chunks will have a size of 40. That was the intention for the precipitation variable, but because the time variable is a 1-D array, it will be much too small. So we empty the time variable of its data for now, and it will be uploaded later with the right chunks.

Repeat

Now that we have all the pieces, it is just a matter of putting them together in a loop. We take care of the time coordinate by uploading in one chunk at the end.

The following code allows to resume an upload, so that you can wait for new data to appear on the FTP server and launch the script again:

Conclusion

This post showed how to stream data directly from a provider to a cloud storage bucket. It actually serves two purposes:

  • for data that is produced continuously, we hacked around the Zarr data store format to efficiently append to an existing dataset.
  • for data that is bigger than your hard drive, we only stage a part of the dataset locally and have the cloud store the totality.

An in-progress pull request will give Xarray the ability to directly append to Zarr stores. Once that feature is ready, this process may become simpler.

pangeo

A community platform for big data geoscience

David Brochart

Written by

Spatial hydrology. Python. Often found on a bike.

pangeo

pangeo

A community platform for big data geoscience