Pangeo 2.0

New Funding and New Directions

Ryan Abernathey
pangeo
11 min readDec 22, 2020

--

On behalf of the Pangeo Steering Council

TLDR: Pangeo is launching new efforts in the areas of education, cloud data storage, and machine learning. We are pivoting away from directly operating cloud-based JupyterHubs, migrating this role to new service providers in this space.

A Brief History of Pangeo

The Pangeo project effectively began in 2016 with a workshop at Columbia university. The schedule is still online and is fun to review. The workshop was an exciting mix of science and technology, a dynamic that continues to characterize Pangeo today. The mission for Pangeo developed at that workshop has stood the test of time:

Our mission is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.

An important aspect to note is that Pangeo was born truly as a grassroots effort. It was not developed top-down by any funding agency, big institution, or university. It was about scientists coming together to solve real-world challenges related to data-intensive science. This community spirit will always remain central to Pangeo, even as the project evolves and matures.

From the workshop also emerged the first successful Pangeo-related grant proposal, to the NSF EarthCube program, which provided $1.2M to LDEO, NCAR, and Anaconda. This was followed quickly by a NASA ACCESS award to UW, NCAR, and Anaconda. The community grew rapidly.

Logo soup of organization involved in Pangeo in some way. Nothing is implied by the relative positioning of the logos. (Apologies to those who were left out…it can be hard to keep track!)

One very serendipitous development early on in the project was an almost accidental entry into cloud computing. When we first proposed to EarthCube in 2017, cloud was not on our radar. We proposed to develop tools for working with big datasets on traditional high-performance computing (HPC) systems such as NCAR’s Cheyenne, as well as some servers deployed locally at Columbia University. The program managers asked us to trim our budget, so we eliminated the server purchase and instead asked to be included in the NSF BIGDATA pilot program, a partnership with commercial cloud providers (now discontinued) which granted cloud computing credits directly to NSF-sponsored research projects. We received $100K of credits on Google Cloud Platform in Nov. 2017. With these credits in hand, we quickly realized that cloud offered unprecedented ability to experiment with new modes of computing and data access. We spun up a cloud-based JupyterHub and began experimenting with Kubernetes, Dask, and cloud object storage. Despite being wholly unprepared to operate a production service, we did not limit use of these resources to funded members of our project; instead, we opened the hubs to basically anyone who wanted to play around with cloud computing and data. We went through three distinct phases of deployment on Google Cloud: the original pangeo.pydata.org Jupyter Hub, then ocean.pangeo.io, and now us-central1-b.gcp.pangeo.io. Our current service is described on our website as Pangeo Cloud. Oh yeah, and we also run customized BinderHubs on both Google Cloud and AWS.

This way of working has been incredibly productive for development of the “cloud-native” style of climate data analysis. At the core of the process has been a dialog between practicing scientists, infrastructure builders, and open-source software developers which has enabled rapid, week-by-week iteration and improvement of all aspects of the cloud-based workflow.

The Pangeo development process

This period has led to some major technological innovations that will have long-lasting impacts on climate data science (and beyond), including:

  • The evolution of file formats and tools for storing climate-style data in cloud object storage. Matthew Rocklin’s initial post on the drawbacks of HDF5 in the cloud helped drive the adoption of and experimentation with new cloud-native array formats like Zarr, TileDB, and Cloud Optimized Geotiff. Pangeo implemented the Zarr-Xarray integration, which, together with our Cloud Data Guide, made it easier for scientists to bring their data to the cloud in analysis-ready form. Many user-supplied datasets are now cataloged in catalog.pangeo.io. These technologies are now achieving much broader adoption. They are also driving new public climate datasets on Google Cloud and AWS. Zarr is on track to become an OGC Community Standard. As an added bonus, we finally figured out how to read archives of data in netCDF/HDF5 on the cloud efficiently. We have also worked to improve tools related to loading and cataloging data, including fsspec, intake, intake-stac, intake-esm, etc.)
  • Dramatic improvements to the Dask experience in the cloud (and on HPC). This has of course been a much broader effort, but the Pangeo community has contributed significantly to the development of Dask Cloud Provider, Dask Gateway, Dask LabExtension, and Dask Jobqueue, all of which simplify the deployment and management of Dask clusters in different contexts. Pangeo science users also tend to push the limits of Dask + Xarray in terms of computational complexity. Continuous iterative improvement via GitHub issues and pull-requests has slowly but steadily improved Dask and Xarray performance and reliability.
  • Contribution to the development of interactive visualization tools as the Holoviz suite of tools, which allow interaction with massive data through the use of Datashader and Bokeh.
  • Development of a rich ecosystem of software packages which leverage the foundations provided by Xarray and Dask to provide advanced analysis capabilities for the ocean / weather / climate domain.
  • Innovation around the sharing and publication of reproducible, real-world Jupyter notebooks that use big data in the cloud via binderbot and gallery.pangeo.io.
  • The operation of a sophisticated CI system for automatically building Docker images with complete Pangeo environments for use in our various cloud hubs and binders.
  • Dozens of educational / training events around the world. (partial list, Pangeo YouTube playlist)

Despite the diversity of these activities, an increasing amount of energy from Pangeo core contributors has recently been devoted to operating cloud-based Jupyter infrastructure, the JupyterHubs on Google Cloud and AWS that now support hundreds of active users. The project has become victim of the unexpected success of these services, which we originally developed as experimental prototypes. Pangeo as currently constituted is simply not well suited to continue to operate production-grade Jupyter-in-the-cloud services. The reason is that the project is very informally organized, lacking any formal legal identity, HR department, etc. We cannot, for example, staff a helpdesk or provide a service-level agreement.

So going forward, we are going to try to wind down our current mode of operating JupyterHubs, seeking a transition to a more sustainable path! If you currently use our cloud Jupyter services, don’t worry! — they are not going away. Below we enumerate some options for those who want to keep using Pangeo in the cloud.

New Infrastructure Providers

A Pangeo-style cloud environment is more than just a vanilla JupyterHub — it also means access to Dask clusters on demand, plus a specialized software environment. When we started operating JupyterHubs in the cloud three years ago, there were few commercial options available for purchasing these services, and we had to roll our own. Now the situation has changed. Some exciting new companies have recently launched to provide Jupyter together with scalable Dask clusters in the cloud. These include

  • Coiled: Founded by the Dask creators, Coiled provides Dask as a service to both individuals and enterprise.
  • Saturn Cloud: one of the first to offer Jupyter + Dask as a service.

Both of these are venture-backed companies, and both have current / former Pangeo collaborators on staff. The cloud providers themselves are also showing signs of providing similar services. The UK Met Office informatics lab has been working closely with the Microsoft Azure team to streamline the Pangeo experience there. Microsoft’s newly announced Planetary Computer project team includes long-time Pangeo contributor Tom Augspurger, and we’re excited to see what they build!

Another new player in the space, aimed more at the academic research and education space, is the International Interactive Computing Collaboration (2i2c). A non-profit, 2i2c was founded by core members of Jupyter and Pangeo. 2i2c’s mission is “to accelerate research and discovery, and to empower education to be more accessible, intuitive, and enjoyable.” Its main activities are as follows:

2i2c provides managed hubs for data science in research and engineering communities. They are tailored for the communities they serve and 💯 open source.

2i2c develops, supports, leads, and advocates for open source tools in interactive computing that are created, used, and controlled by the community.

Going forward, 2i2c will eventually assume operational responsibility for Pangeo current cloud JupyterHub and Binders services. These services will remain free to users, thanks to ongoing support from the NSF EarthCube program. 2i2c is hiring an Open Source Infrastructure Engineer to work on this! If you’re interested in shaping the future of interactive cloud computing, please apply!!

Finally, for those who wish to deploy and operate their own Pangeo-style cloud infrastructure, Quansight recently launched the excellent Qhub project, which massively eases the pains of deployment and management. With Qhub, Groups can set up their own JupyterHubs with Dask Gateway in minutes using cloud-agnostic open-source code and then modify environments, users, groups and more by committing changes to a single YAML file on GitHub. In a similar vein, we have migrated the Pangeo helm chart to the Dask org which means its maintenance benefits from the larger Dask community.

Shifting away from infrastructure operation will allow Pangeo to refocus on our core mission: cultivating the development of innovative open-source tools for solving challenging scientific problems.

New Pangeo Funding and Initiatives

Pangeo will continue to evolve, supported by several major new grants to collaborating institutions.

Project Pythia: Education and Training

Pangeo has driven forward the capabilities of Python tools for geosciences massively in the past years. But these tools will only realize their potential impact if scientists have access to high-quality training for learning to use them!

A new NSF EarthCube grant, awarded to NCAR and the University at Albany, will fund the development of Project Pythia: a community educational resource. The Project Pythia portal aims to provide geoscientists at any point in their career with the educational content and real-world examples needed to learn how to navigate and integrate the myriad packages within the burgeoning Scientific Python Ecosystem. Pythia will cover a range of topics from beginning Python programming to advanced subjects such as developing scalable workflows. A particular emphasis will be placed on migrating workflows to the cloud. Educational content in the Pythia portal will be developed and vetted in part through integration with graduate and undergraduate-level coursework at the University at Albany. More on Project Pythia can be found here.

Pangeo Forge: A Cloud Native Data Repository

Pangeo has blazed a trail towards a “cloud native” way of working with geoscience data. This vision is laid out in the following blog post:

The cloud native approach means avoiding data downloads and instead working directly with massive cloud-based datasets using on-demand scalable computing. We believe that cloud-native has the potential to transform scientific research, making scientists more productive, creative, and flexible.

However, cloud native science requires cloud-based data. Currently, the process of producing analysis-ready, cloud-optimized (ARCO) data is rather painstaking and manual. Our existing Pangeo cloud data catalog is difficult to update.

One option is to wait for data providers such as NASA and NOAA to begin providing their data in ARCO format. However, with Pangeo, we are taking a more proactive approach, by developing a platform for “crowdsourcing” of ARGO data. This project is called Pangeo Forge.

The idea of Pangeo Forge is to copy the very successful pattern of Conda Forge for crowdsourcing the curation of an analysis-ready data library. In Conda Forge, a maintainer contributes a recipe which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository. In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, CI downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them using xarray, writes out the Zarr file, and uploads to cloud storage.

The Pangeo Forge project is just getting off the ground, but we are very excited by the possibilities. Lamont Doherty Earth Observatory recently received a major award from the NSF EarthCube project to build out Pangeo Forge, in partnership with several high-profile data providers. Stay tuned for an announcement with more detail!

Pangeo for Earth System Machine Learning

For over a year now, Pangeo’s Machine Learning Working Group has held an open, monthly meeting to discuss challenges and solutions around big data geoscientific machine learning. The conversations in this meeting have often centered around how to utilize Pangeo’s ecosystem of software and infrastructure to accelerate machine learning with the high-dimensional datasets often found in the geosciences.

Out of those conversations grew the Pangeo-ML project and a now-funded proposal to the NASA ACCESS program. As a collaborative effort led by Joe Hamman between CarbonPlan, Columbia University, the University of Wisconsin-Madison, and Anaconda, the Pangeo-ML project aims to make it easier to use the same software tools for interactive data exploration and ML model development. To achieve this, the project team will improve the integration of data manipulation libraries (e.g. Xarray and Pyresample) — targeting known pain points found in ML applications, and will develop new high-level interfaces between Xarray and deep-learning libraries such as TensorFlow and PyTorch. The Xbatcher project is an early example of an interface tool that helps bridge the gap between Xarray and ML libraries.

Hypothetical research workflow for a scientist using Pangeo-ML. Data hosted on the cloud is integrated into a familiar ecosystem that can provide extract-transform-load functionality as well as exploratory data analysis and visualization. Using the same tool sets, scientists will be able to quickly iterate on model design, and training and validation.

A Fresh Approach to our Weekly Telecons

As a consequence of our drift towards operating cloud infrastructure, our weekly Pangeo meetings have sometimes been dominated by highly technical discussions of Kubernetes Taints and Tolerations (those are real terms, I promise), rather than geoscience software and its applications. As we shift our operational capacity to 2i2c, we also look forward to refreshing the structure of our weekly meeting.

Going forward, this meeting will operate as a virtual seminar, showcasing a cool new tool, dataset, or scientific problem in need of a software / infrastructure solution. Each meeting will feature a 15 minute presentation, which will be recorded and shared on our website. We are currently planning the schedule for Spring 2021, so please don’t hesitate to sign up if you’re interested in presenting!

Full form available at https://forms.gle/hJyhsFvueMXPgqGr6

Through this transition, Pangeo will remain an open, inclusive community. Anyone interested in engaging with the project is encouraged to check out our Discourse forum, which has become the primary forum for discussion and project communication.

Stay tuned for more detailed blog posts about these various new initiatives, and Happy New Year from the Pangeo community!

--

--

Ryan Abernathey
pangeo
Editor for

Associate Professor, Earth & Environmental Sciences, Columbia University. https://rabernat.github.io/