Publishing Xarray Datasets via a Zarr compatible REST API

Joe Hamman
pangeo
Published in
4 min readMar 9, 2020

Xpublish is a new Xarray extension that makes it easy to publish datasets via a Zarr-compatible REST API. You can test drive Xpublish now in this Binder or install it from Conda or PyPi:

$ conda install -c conda-forge xpublish
# or
$ pip install xpublish

Xpublish enables sharing of Xarray datasets via a web application. The data in the Xarray datasets (on the server side) can be backed by Dask to facilitate on-demand computation of derived datasets. The basic usage is as follows:

Server-side: datasets are published using the serve() method on an Xarray Dataset accessor (rest):

>>> ds.rest.serve(host=”0.0.0.0", port=9000)

Client-side: datasets are accessed using any Zarr HTTPStore, such as the HTTPFileSystem provided by the Filesystem Spec (fsspec) project. Here’s an example using Xarray, Zarr, and fspec on the client side:

In [1]: import xarray as xrIn [2]: import zarrIn [3]: from fsspec.implementations.http import HTTPFileSystemIn [4]: fs = HTTPFileSystem()In [5]: http_map = fs.get_mapper(‘http://0.0.0.0:9000')# open as a zarr group
In [6]: zg = zarr.open_consolidated(http_map, mode=’r’)
# or open as another xarray dataset
In [7]: ds = xr.open_zarr(http_map, consolidated=True)
# (Or by any other HTTPStore, e.g. zarr.js)
...

How does this work?

diagram
Xpublish’s wiring diagram. The Zarr client connects to an Xpublish server via an HTTPStore. Xpublish serves data generated via Xarray and Dask APIs via a simple REST API.

When we called ds.rest.serve() above, we started a Uvicorn server running a FastAPI application. That application provided two important endpoints: .zmetadata and /{var}/{key}. The .zmetadata key returns Zarr’s consolidated metadata as a JSON dictionary and /{var}/{key} returns a single chunk of compressed data.

Xpublish also provides a handful of convenience endpoints that don’t involve Zarr, but may be otherwise useful. For example, the root endpoint ( / ) will return Xarray’s new HTML repr:

Screenshot from an Xpublish endpoint showing Xarray’s new HTML repr.

Other REST API endpoints include:

  • /: returns xarray’s HTML repr.
  • /keys: returns a list of variable keys, equivalent to list(ds.variables).
  • /info: returns a JSON dictionary summary of a Dataset variables and attributes, similar to ds.info() or ncdump -h.
  • dict: returns a JSON dictionary of the Dataset schema, equivalent to ds.to_dict(data=False).
  • /versions: returns JSON dictionary of the versions of Python, Xarray, FastAPI and related libraries on the server side, similar to xr.show_versions().
  • /docs: Interactive Swagger API documentation.

Applications

Now that we’ve gone over the mechanics of how Xpublish works, let’s discuss a few potential applications. Because the end user interfaces with Xpublish endpoints just like any Zarr dataset available via an HTTPStore, our examples are largely targeting data providers.

  1. Serving derived data: We often produce multiple derived versions of the same base dataset, with varying levels of processing applied. For example, NASA’s IceBridge data are offered at EOS level 0, 1, 2, and 3. Each of these levels represent a greater level of processing from the raw instrument data. If we had a script that used Xarray and Dask to produce the level 3 data that used a lower level as input, we could serve this data product without having to pre-compute or store the data. With new sensors like NISAR coming online soon, this approach to generating intermediate level data products on-demand has the potential to generate real savings in terms of storage costs.
  2. Xarray and Zarr as a data API: There are many reasons you may prefer a web-api over a static dataset. You may want to broker access to a dataset, perhaps as a way to track data usage or even to generate revenue in some way. A related use case comes in serving derived data products where the underlying data sources change frequently. In these cases, it may be preferable to share data via an API rather than a static dataset.
  3. Serving aggregated data collections: Xarray is quite good at generating aggregate datasets sourced from many individual granules (i.e. files). Much like OpenDAP or THREDDS, Xpublish can serve collections of granules from a single endpoint.
  4. Computational backend to big-data visualization: There are now (at least) two JavaScript implementations of Zarr. Combining something like zarr.js with Xpublish would make for an exciting front-end project.

Discussion and what’s left to do

We think the ability to share derived Xarray datasets via a web-API is really exciting. We want to emphasize, however, that this is merely a prototype. Xarray + Dask + FastAPI + Zarr is a unique combination of libraries and we fully expect the internals of Xpublish to evolve significantly as we explore the application space and pursue performance and usability improvements. Though we’ve identified a few obvious applications, we’re still evaluating the extent of the application space for this project. We expect Xpublish to require additional infrastructure to enable deployment on a variety of cloud and local computing systems. On the performance side, we’re working on some known bottlenecks related to asynchronous computation of dask arrays and server-side caching. If you have a use case that could utilize Xpublish or would like to contribute to the project, get in touch via the GitHub repo.

Acknowledgments

Building Xpublish was a team effort. I want to specifically thank Anderson Banihirwe (NCAR), Landung Setiawan (UW), and Ryan Abernathey (Columbia). The development of Xpublish was supported in part by NASA-ACCESS grant #80NSSC18M0156.

--

--

Joe Hamman
pangeo
Editor for

Tech director at @carbonplan and climate scientist at @NCAR. @xarray_dev / @pangeo_data dev.