The Pangeo Pattern

Joe Hamman
pangeo
Published in
6 min readNov 7, 2019

Ideas included in this post are based on numerous conversations that have occurred across the Pangeo community. Ryan Abernathey, Guillaume Eynard Bontemps, Joe Hamman, Chris Holdgraf, Fernando Perez, Niall Robinson, Matthew Rocklin, Richard Signell, and Amanda Tan Lehr (alphabetized) made specific contributions to the development of these thoughts.

We kicked off the Pangeo Project three years ago this week. Our initial meeting was held in New York at Columbia University — there we, a small group of scientists and software developers, gathered to discuss how software projects like Xarray and Dask could provide the foundational elements for a new approach to data intensive geoscientific analysis (more info in this blog post). In the time since, the Pangeo Project has developed into a multifaceted community project with some novel patterns of development and interaction. Those patterns, which include the development of both a community and open source software, are the topic of this blog post. We hope this blog post provides useful reflection for existing members of the Pangeo community and an effective template for other scientific domains looking to replicate elements of the Pangeo Project.

Today, we define Pangeo as a community promoting open, reproducible, and scalable geoscience. As a community, we’ve worked to establish new approaches to building data-science platforms, including interactive environments (like JupyterHub), new cloud-optimized data formats (like Zarr), and new tools for data analysis at scale (like Xarray and Dask). We have also worked to build a diverse community of researchers and developers that focus on general computational challenges in the domain sciences. This focus on common challenges has brought us into contact with other open source communities (like Project Jupyter) and research domains (like solar physics and biology).

The Pangeo effort is fundamentally concerned with optimizing the workflow of scientists, which is affected by various obstacles. Pangeo resisted the temptation to solve this problem by creating a monolithic end-to-end data platform, which have proved brittle in the past. Instead, Pangeo contributes to component projects, as well as providing various recipes (via Infrastructure as Code, IaC) which describe how these components are composed to create different flavors of platforms for different use cases. As well as creating effective solutions, this has the added benefit of allowing the broad Pangeo community to work on separate problems.

We have previously written about the technical design principles that underpin the development of the Pangeo project. Here, we focus on the central patterns that tie the Pangeo community together. There may be more, but we’ve chosen to focus on the following three, expanding on them in the sections below:

  1. Focus on solving fundamental problems in scientific research,
  2. Contribute to an ecosystem of modular open source software projects that support open science, and
  3. Build interdisciplinary teams that cut across academia, government and industry.

Solve fundamental problems

Identifying common challenges that face broad swaths of a scientific community can be a difficult task. With Pangeo, we’ve focused on promoting open, reproducible, and scalable science. While those issues are not unique to any one scientific field, they were clearly challenges facing the geosciences. We explicitly state three project goals:

  1. Foster collaboration across the geosciences: This meant creating an organization that solicited participation from a wide range of geoscience disciplines and from software developers beyond the confines of academia. This also meant promoting and contributing to open science initiatives and open source projects that serve the geoscience community.
  2. Support the development of domain-specific packages: Taking inspiration from the astronomy community’s Astropy project, we were interested in developing something similar for the geosciences. Rather than developing a single project like Astropy, we opted to support the development of many smaller projects that share common data structures (e.g. Xarray objects). This allowed us to continue developing the underlying tools while encouraging lower commitment development of domain-specific packages.
  3. Improve the scalability of the scientific software tools for interactive analysis: The geosciences, like many other data intensive research domains, are facing an onslaught of new data products at ever increasing data volumes. This challenge was set to topple many of our scientific ambitions. Our approach has been to improve integration between data analysis libraries like Jupyter, Xarray and Pandas with tools like Dask for parallel processing to support interactive computational analysis on big data.

Open source software and open science

The open source scientific software tradition is built on the ideas that we stand on the shoulders of giants and that transparent, collaborative development yields a more capable software stack. Indeed, this is where the Pangeo project found its beginnings. A group of scientists and software developers had been openly collaborating on the development of the Xarray and Dask libraries, both open source scientific Python projects. For the scientists in this group, these projects were vital elements in their research machinery, and contributing to them was also seen as a way to build capacity in their own research area. For the software developers, these scientific applications provided validation and feedback to improve design.

Today, this is still at the core of what Pangeo is about. We work on a wide variety of open source software projects. Many of these projects are agnostic to the scientific domains we work in (e.g. Jupyter, Dask), while others are developed to address specific scientific problems. We’ve recently put together an Awesome List of Open Atmospheric, Ocean, and Climate Science tools. It’s encouraging to look back and see how the ecosystem has grown over the past few years.

Out of these efforts come five patterns of use and development:

  1. Reuse: Resist the temptation to build something new. Often times it’s possible to simply use an existing tool with just a few modifications or customizations.
  2. Integrate: The open source scientific software ecosystem is most effective when the individual components are integrated together. Using and building tools that are modular means that each piece is useful alone, and together they are greater than the sum of their parts.
  3. Document: Integration is often simply a documentation effort. Providing the community with clear documentation on how to use individual pieces of the stack and how to combine them for a particular use-case grows the impact that we have for our community.
  4. Demonstrate: Provide examples of how new developments, integrations, or platforms serve real world science applications. For Pangeo, this has meant setting up public infrastructure (e.g. binder.pangeo.io) that facilitates interactive exploration of geoscientific data and software tools.
  5. Contribute upstream: By pushing new features, enhancements, bug fixes and documentation upstream, we allow progress in the Pangeo Project to be progress for the scientific community. We also increase the likelihood that scientists from other fields will use and contribute to these tools, enhancing both the functionality and sustainability of the ecosystem as a whole.

Interdisciplinary teams and communities

Many of the problems facing domain scientists today are not easily solved in isolation. Pangeo has managed to build a community that includes members from academia, government research institutions (e.g. USGS, NASA, UK Met Office), non-profits, and from private companies (e.g. Google, Anaconda, NVIDIA). The make-up of this coalition is important because, in the process of growing the community, we’ve been able to tap into a wide variety of use cases, applications, and expertise. Even amongst the geoscientists involved with Pangeo there is a wide range of disciplines; everything from climate science, hydrology, oceanography, and geophysics. The same is true among the developers and engineers involved with the project, who bring a wide range of skills from DevOps to data visualization to distributed computing.

Pangeo has also been fortunate to acquire grant funding from various sources (e.g. Alfred P. Sloan, NSF, and NASA). This funding has been a crucial catalyst in the development of the project, allowing sustained effort across the project at a scale otherwise not possible. We also view grant funding as a tool for growing the community in new directions, in terms of development objectives and collaborators.

Pangeo has also worked to engage with other developer communities like Project Jupyter and other scientific domains like biology and astronomy. We’ve made significant efforts to work with these communities on their terms, looking for places where collaboration will serve both communities well. Examples where we’ve seen this work are through collaboration on shared upstream projects (e.g. Binder) or through the demonstration of the Pangeo approach in new domain (check out this demonstration using Pangeo on astronomy data).

Conclusions

It is important to acknowledge at this point that the Pangeo Project is still defining itself and that these patterns will likely continue to evolve. We’ve written this blog post for two reasons. First, it was an opportunity to reflect on the patterns we’ve developed over the past few years. Second, we think this might be a good conversation starter for other communities (e.g. biology, neuroscience, astronomy) that are potential candidates for a Pangeo-like efforts.

--

--

Joe Hamman
pangeo
Editor for

Tech director at @carbonplan and climate scientist at @NCAR. @xarray_dev / @pangeo_data dev.