pangeo
Published in

pangeo

Pangeo 2.0

New Funding and New Directions

A Brief History of Pangeo

The Pangeo project effectively began in 2016 with a workshop at Columbia university. The schedule is still online and is fun to review. The workshop was an exciting mix of science and technology, a dynamic that continues to characterize Pangeo today. The mission for Pangeo developed at that workshop has stood the test of time:

Logo soup of organization involved in Pangeo in some way. Nothing is implied by the relative positioning of the logos. (Apologies to those who were left out…it can be hard to keep track!)
The Pangeo development process
  • The evolution of file formats and tools for storing climate-style data in cloud object storage. Matthew Rocklin’s initial post on the drawbacks of HDF5 in the cloud helped drive the adoption of and experimentation with new cloud-native array formats like Zarr, TileDB, and Cloud Optimized Geotiff. Pangeo implemented the Zarr-Xarray integration, which, together with our Cloud Data Guide, made it easier for scientists to bring their data to the cloud in analysis-ready form. Many user-supplied datasets are now cataloged in catalog.pangeo.io. These technologies are now achieving much broader adoption. They are also driving new public climate datasets on Google Cloud and AWS. Zarr is on track to become an OGC Community Standard. As an added bonus, we finally figured out how to read archives of data in netCDF/HDF5 on the cloud efficiently. We have also worked to improve tools related to loading and cataloging data, including fsspec, intake, intake-stac, intake-esm, etc.)
  • Dramatic improvements to the Dask experience in the cloud (and on HPC). This has of course been a much broader effort, but the Pangeo community has contributed significantly to the development of Dask Cloud Provider, Dask Gateway, Dask LabExtension, and Dask Jobqueue, all of which simplify the deployment and management of Dask clusters in different contexts. Pangeo science users also tend to push the limits of Dask + Xarray in terms of computational complexity. Continuous iterative improvement via GitHub issues and pull-requests has slowly but steadily improved Dask and Xarray performance and reliability.
  • Contribution to the development of interactive visualization tools as the Holoviz suite of tools, which allow interaction with massive data through the use of Datashader and Bokeh.
  • Development of a rich ecosystem of software packages which leverage the foundations provided by Xarray and Dask to provide advanced analysis capabilities for the ocean / weather / climate domain.
  • Innovation around the sharing and publication of reproducible, real-world Jupyter notebooks that use big data in the cloud via binderbot and gallery.pangeo.io.
  • The operation of a sophisticated CI system for automatically building Docker images with complete Pangeo environments for use in our various cloud hubs and binders.
  • Dozens of educational / training events around the world. (partial list, Pangeo YouTube playlist)

New Infrastructure Providers

A Pangeo-style cloud environment is more than just a vanilla JupyterHub — it also means access to Dask clusters on demand, plus a specialized software environment. When we started operating JupyterHubs in the cloud three years ago, there were few commercial options available for purchasing these services, and we had to roll our own. Now the situation has changed. Some exciting new companies have recently launched to provide Jupyter together with scalable Dask clusters in the cloud. These include

  • Coiled: Founded by the Dask creators, Coiled provides Dask as a service to both individuals and enterprise.
  • Saturn Cloud: one of the first to offer Jupyter + Dask as a service.

New Pangeo Funding and Initiatives

Pangeo will continue to evolve, supported by several major new grants to collaborating institutions.

Project Pythia: Education and Training

Pangeo has driven forward the capabilities of Python tools for geosciences massively in the past years. But these tools will only realize their potential impact if scientists have access to high-quality training for learning to use them!

Pangeo Forge: A Cloud Native Data Repository

Pangeo has blazed a trail towards a “cloud native” way of working with geoscience data. This vision is laid out in the following blog post:

Pangeo for Earth System Machine Learning

For over a year now, Pangeo’s Machine Learning Working Group has held an open, monthly meeting to discuss challenges and solutions around big data geoscientific machine learning. The conversations in this meeting have often centered around how to utilize Pangeo’s ecosystem of software and infrastructure to accelerate machine learning with the high-dimensional datasets often found in the geosciences.

Hypothetical research workflow for a scientist using Pangeo-ML. Data hosted on the cloud is integrated into a familiar ecosystem that can provide extract-transform-load functionality as well as exploratory data analysis and visualization. Using the same tool sets, scientists will be able to quickly iterate on model design, and training and validation.

A Fresh Approach to our Weekly Telecons

As a consequence of our drift towards operating cloud infrastructure, our weekly Pangeo meetings have sometimes been dominated by highly technical discussions of Kubernetes Taints and Tolerations (those are real terms, I promise), rather than geoscience software and its applications. As we shift our operational capacity to 2i2c, we also look forward to refreshing the structure of our weekly meeting.

Full form available at https://forms.gle/hJyhsFvueMXPgqGr6

--

--

A community platform for big data geoscience

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store