Zarr pyramids at scale
(by Raphael Hagen and Max Jones)
For several years we’ve been developing a toolkit at CarbonPlan for data-driven maps on the web. Our toolkit supports dynamic, customized visualization of N-dimensional datasets by leveraging the Zarr data format and multi-scale pyramids, which are downsampled versions of the original dataset. Last year, we released a set of improvements to the carbonplan/maps library to simplify data pre-processing requirements. Now in partnership with DevelopmentSeed, we’re excited to announce a new set of features to improve the user experience when generating pyramids using our ndpyramid Python library.
Multi-scale pyramids are useful for visualizing high-resolution earth science data, as they allow you to quickly view a lower resolution global map as well as smoothly increase the resolution as you zoom in. However, multi-scale pyramids can be difficult to generate at scale because many existing geospatial reprojection tools do not support distributed processing. The ndpyramid Python library provides utilities for multi-scale Zarr pyramid generation. We’ve recently implemented new features in ndpyramid to support scalable Zarr pyramid generation and developed a Pangeo Forge extension to run pyramid generation at scale.
We’ve added a new method, pyramid_resample to improve multi-scale pyramid generation in ndpyramid. This uses the pyresample library to parallelize pyramid generation with Dask which gives speed-ups of over 5 times for bilinear interpolation. These performance improvements can help reduce cloud costs for processing and avoid out-of-memory issues.
from ndpyramid import pyramid_resample
resampled_pyramid = pyramid_resample(
ds,
x="lon",
y="lat",
levels=2,
resampling="bilinear"
)
resampled_pyramid.to_zarr("pyramid.zarr")
In addition to these performance improvements, we built an extension to the open-source Pangeo Forge project. Pangeo Forge is a tool for creating Zarr stores from archival file formats such as NetCDF, GRIB, TIFF etc. It strings together composable transforms into pipelines. These pipelines can be run on large-data processing frameworks such as Apache-Spark and Google-Dataflow. Our extension library pangeo-forge-ndpyramid allows you to create pyramids in the Pangeo Forge framework. It contains a new transform, StoreToPyramid which creates pyramids from existing Zarr stores or collections of archival files.
from pangeo_forge_ndyramid import StoreToPyramid
recipe = (
beam.Create(pattern.items()
| OpenWithXarray()
| StoreToPyramid()
)
Detailed demo notebooks for both are linked here: ndpyramid and pangeo-forge-ndpyramid. We’re excited about supporting the community’s use of multi-scale Zarr pyramids and future development in this area. If you’re also interested in these problems, please reach out to us at hello@carbonplan.org.