Streaming with Zarr

Joe Hamman
pangeo
Published in
5 min readFeb 28, 2020

Simulation models continue on their path to the exascale — motivating the search for a new paradigm for model data analysis. We explore the use of Zarr (a data format for chunked, compressed, N-dimensional arrays) and Redis (a streaming database) to support streaming data from simulation models to analysis environments.

logos

Weather and climate simulation models are producing data at an ever increasing rate. These models are traditionally run in large HPC computing facilities and make extensive use of fast parallel file systems to support archiving their model history. As models have continued to grow in size (higher resolution, more simulated variables, etc.), a natural pressure to balance model runtime, output data volumes, and applications has emerged. As a result, many simulated variables are never examined or may only be archived in some aggregated form (i.e. averaged over multiple model timesteps). We think these constraints, based almost entirely on the cost of storing large volumes of data on disk, are limiting our ability to fully utilize our models.

If we can’t output the full contents of our model simulations, what can we do to better leverage the information in our models? One option is to take our analysis to the model. Often this means writing complex analysis code inside the runtime environment of each model. This approach is fraught with obvious challenges, most notably the inefficiency of writing one-off analysis code in a low-level language (i.e. C or Fortran). Another option is to stream data to some ephemeral cache and let scientists do their analysis while the model runs. This is the topic of our exploration here.

We want to note that later this year, Zarr support is expected in the NetCDF-C API (NetCDF is the widely used archival format for weather and climate model data). This is an exciting advancement because it will allow us to begin testing a host of new model data applications and workflows, including streaming NetCDF-Zarr datasets directly from weather and climate models to new storage mediums like cloud object store and streaming services.

Zarr and Redis

Redis (REmote DIctionary Server) is a distributed, in-memory key-value database with multi-language support (including C, Python, and many more). Zarr can utilize Redis as a MutableMapping interface to store its chunked, compressed multi-dimensional arrays. Here we demonstrate how these two technologies can be used together. For our demonstration, we have deployed a simple simulation model (Conway’s Game of Life). Check out this Binder to access an interactive demonstration.

Server-side (simulation model)

First, start a Redis server:

$ redis-server redis.conf --loglevel verbose --port 7777

We’ve configured the memory eviction policy of our Redis server in our redis.conf file:

maxmemory 5mb
maxmemory-policy allkeys-lfu

Our Redis server will hold up to 5mb of memory and will evict data that was least-frequently-used (lfu) . Real world applications would obviously want to increase the maxmemory setting substantially and may want to further configure the Redis server.

Next, we start our simulation model, telling it where it can find the Redis server:

$ python3 conway.py —-port 7777

The run loop inside conway.py is quite easy to read. We first connect to our Redis Server using Zarr’s RedisStore. Then we create a Zarr group inside of which we will place our stream of model generated arrays. Finally, we begin the time loop, updating adding new keys to the group after each update.

A snippit from our “simulation model” — Conway’s Game of Life: https://github.com/jhamman/streaming-zarr/blob/master/conway.py

This will begin streaming data — one timestep at a time — to our Redis server. Next, we’ll head to our analysis environment.

Client-side (analysis environment)

Connecting to our Redis server on the client side is the same as above:

store = zarr.RedisStore(port=7777)

And any key can be read from the store using Zarr’s open_array function:

key = “1”  # first timestep
data = zarr.open_array(store=store, mode=’r’, path=key)

From here, we are all set to begin our analysis. In our Binder, we walk through a handful of applications that use this setup including visualizing the model as it runs. Below is a short animation that shows Conway’s Game of Life evolving in real time:

And some basic data analysis that aggregates across many model timesteps. In this case, we simply calculate the total population at each timestep (generation) in the model simulation:

Conclusion

This has been a quick demonstration of how we may use Zarr to support streaming data from simulations models. A few parting thoughts:

  • Although we’ve used Redis as our streaming cache in this example, we acknowledge there are other streaming services out there that may provide advantages for certain applications. Fortunately, developing new MutableMapping interfaces for Zarr is easy enough that we can test other options as they come along.
  • Our example has highlighted the use-case where we want to stream data out of a simulation model to avoid writing data to disk. There are other applications where this setup may be useful; examples such as monitoring ongoing simulations that require some sort of manual intervention or sharing of derived data between applications are two use cases that come to mind.
  • Finally, this new approach to incremental data analysis will require some new tools for effectively processing continuous streams of data. We’re excited to try out the streamz package in this space and look forward to seeing what other tools are out there.

Acknowledgements

This work was inspired by an early prototype developed with Noah Brenowitz, James Douglass, and Aji John at the 2017 UW GeoHackweek. Philipp Rudiger contributed to the Binder Bokeh animation.

--

--

Joe Hamman
pangeo
Editor for

Tech director at @carbonplan and climate scientist at @NCAR. @xarray_dev / @pangeo_data dev.