Zarr pyramids at scale

Raphael Hagen
pangeo
Published in
2 min readJul 15, 2024

(by Raphael Hagen and Max Jones)

For several years we’ve been developing a toolkit at CarbonPlan for data-driven maps on the web. Our toolkit supports dynamic, customized visualization of N-dimensional datasets by leveraging the Zarr data format and multi-scale pyramids, which are downsampled versions of the original dataset. Last year, we released a set of improvements to the carbonplan/maps library to simplify data pre-processing requirements. Now in partnership with DevelopmentSeed, we’re excited to announce a new set of features to improve the user experience when generating pyramids using our ndpyramid Python library.

Multi-scale pyramids are useful for visualizing high-resolution earth science data, as they allow you to quickly view a lower resolution global map as well as smoothly increase the resolution as you zoom in. However, multi-scale pyramids can be difficult to generate at scale because many existing geospatial reprojection tools do not support distributed processing. The ndpyramid Python library provides utilities for multi-scale Zarr pyramid generation. We’ve recently implemented new features in ndpyramid to support scalable Zarr pyramid generation and developed a Pangeo Forge extension to run pyramid generation at scale.

A carbonplan/maps demo can be found here: https://maps.demo.carbonplan.org/

We’ve added a new method, pyramid_resample to improve multi-scale pyramid generation in ndpyramid. This uses the pyresample library to parallelize pyramid generation with Dask which gives speed-ups of over 5 times for bilinear interpolation. These performance improvements can help reduce cloud costs for processing and avoid out-of-memory issues.

from ndpyramid import pyramid_resample

resampled_pyramid = pyramid_resample(
ds,
x="lon",
y="lat",
levels=2,
resampling="bilinear"
)
resampled_pyramid.to_zarr("pyramid.zarr")

In addition to these performance improvements, we built an extension to the open-source Pangeo Forge project. Pangeo Forge is a tool for creating Zarr stores from archival file formats such as NetCDF, GRIB, TIFF etc. It strings together composable transforms into pipelines. These pipelines can be run on large-data processing frameworks such as Apache-Spark and Google-Dataflow. Our extension library pangeo-forge-ndpyramid allows you to create pyramids in the Pangeo Forge framework. It contains a new transform, StoreToPyramid which creates pyramids from existing Zarr stores or collections of archival files.

from pangeo_forge_ndyramid import StoreToPyramid

recipe = (
beam.Create(pattern.items()
| OpenWithXarray()
| StoreToPyramid()
)

Detailed demo notebooks for both are linked here: ndpyramid and pangeo-forge-ndpyramid. We’re excited about supporting the community’s use of multi-scale Zarr pyramids and future development in this area. If you’re also interested in these problems, please reach out to us at hello@carbonplan.org.

--

--