Spotlight: Zarr

Tim Bonnemann
Open-Source Science (OSSci)
7 min readJun 12, 2024
Screenshot of the Zarr homepage

Welcome to our new Spotlight Series, where we get to know science-focused open-source software projects in order to better understand how Open-Source Science (OSSci) can add value and help accelerate scientific research through better open source. Once again, our this interview was conducted asynchronously via shared doc. Enjoy!

Hello, thanks for joining us! Who are you? What do you do? And how did you get into open-source scientific software?

Josh: I’m Josh Moore, a research software engineer. I’ve been working in the bioimaging space for about 20 years, primarily focused on image data management. One of our biggest struggles is the number of file formats (>160), and most of those are not easily accessible online. Zarr has been great for motivating the community with the possibility of a common (cloud) access mechanism.

Sanket: Hi! My name is Sanket Verma. I hail from Delhi, India. I’m the community manager for Zarr and have been in this position for the past 1.5+ years. Before joining Zarr, I was heavily involved with the community building and fostering of PyData culture throughout the Indian region. Through my efforts at PyData and NumFOCUS, I was introduced to open-source software and the need for sustainability, accessibility and reproducibility of scientific projects. I utilized OSS during my previous technical roles, which ranged from research to product and startup-based organizations.

What is Zarr? How did it originate?

Sanket: Zarr is a specification for a data storage format. Zarr lets you store large N-dimensional arrays, commonly known as tensors, in local and distributed systems like cloud storage. The specification has software implementations in at least eight programming languages like C, C++, Java, Javascript, Julia, Python, R, and Rust. Zarr was founded by Alistair Miles in 2015, a genomic scientist working with mosquitoes. While working with vast datasets of malaria mosquitoes’ genomes, Alistair’s design choice was to break down the large-size monolithic file into small equal parts (also known as chunks). This would help access the parts of the file according to the usage and not load the entire data into the memory, which allows Zarr to work with out-of-memory datasets.

A Zarr file can be stored in local and cloud storage, which is super convenient when working with datasets in 100s of GBs or even TBs.

Zarr’s chunking strategy uses a key-value interface — In simple words, the key would be the chunk’s name, and the value would be its contents. Any system which can access the key-value format can store data in Zarr. This would make Zarr agnostic to a wide range of computational systems throughout the globe. It only took a few years before Zarr gained traction. As of today, it is widely recognized and used in geospatial, biomedical, genomics, data science, embedded systems and many more.

To learn more:

“Monolithic vs. chunked” by Henning Falk

Could you give an example of how Zarr is being used in practice?

Josh: Yes, check out our growing list of Zarr adopters and these lists to find data examples in a number of areas:

Here are a few articles:

Recent book chapter:

Tools

How is your project funded? And how do you manage to sustain it?

Sanket: Zarr has received funding from CZI twice under the EOSS (Essential Open Source Software for Science) program. Sustainability of an OS project is always a good discussion and a challenging task. We carefully plan and utilize the fund for several essential activities like contracting with developer organizations for feature development, hiring a dedicated community manager, working on project maturity tasks like trademarking, etc.

We also have a budget to participate in developer programs like Google Summer of Code and Outreachy to onboard new contributors.

What are the key challenges that keep you up at night, both in your day-to-day work on Zarr and longer term?

Josh: Long-term stability of the file format. Will users be able to read their data in 10 or even 50 years?

Sanket: The Zarr Community is growing every day. The diverse and flourishing communities are a sign of a healthy and evolving open-source project, but it comes with challenges. Amidst its growth, we must ensure that the community is open and welcoming for everyone, regardless of background, race, gender, etc.

When many discussions happen over GitHub, we need to ensure that the discussions are concluded promptly, and the issues are resolved.

Among all the critical discussions, we must also ensure that the participants and community adhere to the code of conduct.

The long-term challenges for Zarr are working towards standardising the data storage format among various domains that utilise Zarr. The optimistic goal is that the majority adopt Zarr for their data storage needs.

Also, while we’re working/discussing on a new feature/proposal, sometimes it takes considerable time to reach a consensus. This is good because we don’t want decisions made hastily. But this is also a challenge in the long term, where we can reach a consensus on time.

Looking ahead at the next couple of years or so, where do you see your project is headed? What are your aspirations?

Sanket: Over the next couple of years, we’ll aim to expand the Zarr community and extend our reach to newer domains with the appetite for robust formats for large datasets. Micro user/developer communities arising inside the project and spearheading various initiatives are signs of a healthy open-source project, and we’d very much like to focus on that.
One of the outcomes of micro-communities is folks self-organising meetups/sprints for Zarr, which has already started, and we encourage and support those initiatives.

Examples:

On the technical side — we’d be continuously working on the evolution of the Zarr specification. The latest V3 of the Zarr specification was accepted by the steering and implementation council on May 15, 2023, and the implementations exist in various programming languages. Since adoption, we’ve observed that the users’ need for new and exciting features has grown, which will be achieved by expanding the existing specification. Zarr Enhancement Proposals or ZEPs is a process which underlines how the changes should be added to the specification in a well-defined manner. Driving the project towards the evolution of the specification and improving the current ZEP process would be helpful for both the users and maintainers of the project.

Zarr-Python currently interoperates with Dask and Xarray, so you can create Zarr arrays using these libraries. With time, we’d like to improve and expand the interoperability with other projects in the PyData ecosystem. Zarr-Python is also one of the core projects of the Scientific Python umbrella. SP is an active community comprising projects and developers from NumPy, Pandas, SKLearn, Xarray, etc. In the future, we plan to adopt the SPECs and engage in the shared practices around packaging, CI and API design.

Last but not least, who are your ideal contributors? And how can people get involved?

Josh: Obviously, any active contributor is an ideal contributor. One of the interesting twists to contributing to Zarr, as with any specification (Parquet, Arrow, HDF, etc.), is that the stability of the format takes precedence over library development.

Sanket: Agreed, anyone using the project and willing to contribute is an ideal contributor. I’ve seen quality PRs when there’s a desire to contribute to the code base, add features, and fix bugs. Familiarity with the code base also plays an essential role in contribution.

Users utilising data formats similar to Zarr, like HDF5, NetCDF, and N5, have regularly contributed to Zarr in various ways, including technical contributions like fixing the docs, attending community meetings, working on new ZEP proposals, etc.

There are several ways to contribute to Zarr.

  • If you’d like to be involved with Zarr-Python, please refer to the contributing guide.
  • If you’d like to be involved with ZEPs, please refer to open issues.
  • If you’d like to be involved with numcodecs, please refer to the guide here.

Alternatively, you can join our community meetings, which take place every two weeks. We also have ZEP meetings where discussions mainly focus on the Zarr specification.

Every two weeks, we host Zarr office hours for the community. All the meetings are on our public calendar.

Thank you both very much!

--

--

Tim Bonnemann
Open-Source Science (OSSci)

Intersection of community & participation. Currently @IBMResearch. Wannabe trailrunner.