The 2018 Pangeo Developers Workshop
The first Pangeo Developers Meeting was held at NCAR in Boulder, Colorado from Aug. 13–15. The purpose of this meeting was to allow our highly distributed group of contributors to talk face to face, brainstorm ideas for the future of Pangeo, and address pressing technical challenges head on. Although the workshop was deliberately small (30 people; see full list below), we had attendees from as far as the UK, France, and Australia! The workshop consisted of a few talks by various Pangeo contributors as well as plenty of free, unstructured time for code sprints and discussion.
Below I sum up some of the most exciting outcomes of the meeting. This is a large dump of information — we plan to have some more specialized blog posts on some of the topics mentioned below in the near future.
Pangeo + Jupyter
Pangeo leans heavily on Jupyter for both its user interface (Jupyter Lab and Notebook) and its cloud architecture (based on zero2jupyterhub-k8s). So we were thrilled to have three members of the Berkeley Jupyter team attend the workshop: Ian Rose, a geophysicist and UI expert on the Jupyter Lab project; Yuvi Panda, devops guru and mastermind of all things cloud; and Fernando Perez himself, the creator and BDFL of the Jupyter project. Over the course of the workshop, the three people lent their unique knowledge experience to various ongoing efforts within Pangeo (detailed below). Overall we were excited to learn that Fernando sees Earth Science as a high priority for his own future research and development efforts; we are looking forward to collaborating even more closely with the Project Jupyter!
Analyzing Climate Model Data on Cheyenne
Being at NCAR, we were excited to advance Pangeo’s capabilities for analyzing climate model data living on Cheyenne, NCAR’s flagship supercomputer. Matt Long, Mike Levy, and Gustavo Marques all worked on various aspects of this problem.
Matt and Mike worked on making it easy to launch Jupyter notebooks on Cheyenne by developing some custom launch scripts. (There was much discussion over how we could get CISL to provide more official support for these tools, which would make the process a lot smoother. The excellent Jupyter support at NERSC looks like a good goal to strive for.) They developed a simple example notebook for loading and analyzing data from the CESM Large Ensemble Project in parallel using dask jobqueue.
Gustavo instead focused on exporting data from the new MOM6 ocean model into zarr format (using dask jobqueue on Cheyenne) and uploading it to Google Cloud Storage. We were all impressed by how quickly Gustavo spun up on the Pangeo stack. We now have a new example notebook for MOM6 which runs on pangeo.pydata.org! Just log on and try it out.
Enhancing the Pangeo Cloud Experience
In addition to our work on traditional HPCs like Cheyenne, Pangeo has been increasingly focused on using the commercial cloud for large-scale scientific data analysis. The cloud offers many technical and social advantages over traditional HPC, including easy user access, access to scalable object storage, and the ability to scale quickly to large numbers of compute nodes. Pangeo’s experimental cloud service — pangeo.pydata.org — has been running since March. Although the service is public, it is not exactly “production ready” — our goal has been to learn about how to work with the cloud, and we have learned a ton so far!
This workshop gave us an opportunity to start upgrading the Pangeo cloud experience. A central goal is to provide more diagnostic information about the cluster to users. Building on prototypes he developed for the UK Met Office Pangeo cluster, Jacob Tomlinson worked on expanding Grafana-based real-time monitoring of cluster usage statistics. He has developed amazing dashboard which provides all sorts of useful information (see pangeo-data/pangeo#359). The next step is to integrate this directly into Jupyter Lab as an extension.
Speaking of extensions, Ian Rose worked on a killer feature that users clearly want: integration of the dask dashboard directly into Jupyter Lab. We already have the ability to launch dask clusters interactively from notebooks thanks to tools like dask-jobqueue and dask-kubernetes, but the dashboards from these clusters appear in another browser window. This extension would be a game changer in terms of creating interactive users experiences within Jupyter lab itself. Ian’s preliminary work is on GitHub.
Finally, Yuvi Panda continued his heroic efforts to democratize access to cloud computing for scientists, from which Pangeo has already benefitted immensely. Yuvi showed off The Littlest JupyterHub, a project which makes it dead simple to deploy a simple Jupyter Hub in a range of different circumstances. We brainstormed someideas for what the “Littlest Pangeo” might look like. He also worked on HubPloy, a tool to simplify the deployment of Kubernetes-based Jupyter Hubs. HubPloy would solve lots of the headaches Pangeo is facing in managing cloud clusters. In our discussions, Yuvi shared his thoughts about how projects like Pangeo can maintain freedom and independence from the cloud computing giants by designing their platform in a way that avoid vendor lock in. He summarized his advice in a blog post. (See also a related post by Matthew Rocklin.)
This post wouldn’t be complete without an acknowledgement of the important role played by Matt, who generally spent his time at the workshop bouncing around and lending his expertise wherever it was needed. Beyond his deep technical knowledge (Matt is the creator of Dask, a key part of the Pangeo platform), he is great at communicating with people of all backgrounds and keeping technical meetings focused and productive. Thanks Matt!
Pangeo + Binder
The Binder Project allows you to turn any Jupyter notebook stored in a GitHub repo into an actually running notebook (with all its dependencies) in the cloud; this technology is a quantum leap for scientific reproducibility.
Authors: The Binder project is comprised of many individuals within and outside of the core Jupyter team. A list of…blog.jupyter.org
By default, binder launches these notebooks into a dedicated Jupyter Hub running on Google Cloud, but we would like the ability to launch directly into a Pangeo cloud deployment. This would enable the binder notebook to take advantage of dask-based parallelism (via dask kubernetes) and access any cloud-based data stores associated with that cluster.
Joe Hamman is pushing hard to make this a reality. With help from the Jupyter folks, he has developed a prototype solution up on GitHub as a customized helm chart. We should soon expect to see a test Pangeo binder service up and running! We hope this efforts will benefit the broader scientific community by expanding the flexibility of the binder service. A longer blog post on this will be up very soon!
Pangeo + Cloud Optimized Geotiff
The GIS community has done a great job defining a format for “analysis ready data”: the Cloud Optimized Geotiff (COG). Pangeo aims to make it easy for scientists to use cloud computing for interactive, parallel analysis. So this is potentially a match made in heaven.
Dan Rothenberg and Scott Henderson worked hard on applying Pangeo cloud deployments to process COGs stored in Google Cloud Storage and Amazon S3. Dan focused on improving xarray-COG integration by developing automatic chunk-size detection for geotiffs read by rasterio (see pull requests pydata/xarray#2255 and dask/dask#3878). Meanwhile, Scott developed a very cool example notebook showing off how quickly and efficiently xarray + dask + rasterio can pull and analyze data from AWS COGs in parallel. There is lots of potential here. Expect a more detailed blog post on this work in the near future!
Cloud Ready Data
In a recent post, I laid out a (slightly tongue-in-cheek) vision for cloud-native data repositories:
The volume of scientific datasets is growing at an exponential rate, and scientists are struggling to keep up. This Big…medium.com
As described clearly in a blog post by Matthew Rocklin, the science community currently lacks a good standard for cloud-optimized NetCDF / HDF type data. At the developers workshop, I summed up our positive experience so far with zarr, a promising new format storage format for chunked, compressed multidimensional numeric arrays. But I noted that zarr was still far from a community standard.
So I was pleasantly surprised to learn from Ward Fisher, NetCDF Team Lead, Lead Developer at Unidata, that the NetCDF group is considering making zarr the back-end for it’s next generation NetCDF library. (The other contender is TileDB, and it’s possible we will see support for both libraries within NetCDF.) This would truly be a game changer, allowing data providers to host NetCDF data in the cloud in a way that is simultaneously archive-quality and analysis-ready. We had a useful discussion of the pros and cons of zarr vs. TileDB. Overall I’m satisfied to see that Pangeo has managed to influence this important discussion.
Our platform aims to unlock the untapped scientific value contained in large datasets, so making it easy to discover and load data is a crucial need. During the code sprints, several of us took on the problem of how to build data catalogs for Pangeo. To this end, Rich Signell of USGS and Andrew Pawloski of Element84 worked on connecting Pangeo cloud deployments to OGC metadata services. Their example notebooks shows how it’s possible to search
for data from existing catalogs, connect to remote opendap or buckets
with zarr datasets, then load and process the data in a streaming
fashion using dask. Focusing instead on our newly created zarr cloud datastores, I worked a bit on integrating Martin Durant’s new Intake data catalog tool with Pangeo. Intake makes it easy to expose data sources to users from within python and could help remove some friction from the data discovery / loading process. We are especially excited about the possibility of integrating these tools with Jupyter Lab, allowing users to browse and load data visually. To this end, Ian Rose helped us get started on developing custom Jupyter Lab Extensions.
Finally, Nick Mortimer of CSIRO worked on transcoding ARGO float data into zarr and uploading to cloud storage. His goal is to make it easier to process this extremely rich dataset en masse in order to train machine learning models.
Connecting with Legacy Geoscience Libraries
Pangeo aims to build a platform for the future of data-driven geoscience research, based on the modern scientific python ecosystem. But there is a huge amount of knowledge embodied by legacy geoscience software that is not easily accessible to this ecosystem. To this end, we were thrilled to have the participation of Ben Koziol of NOAA ESRL — an expert in the Earth System Modeling Framework — and Mary Haley of NCAR — leader of the NCL team. Ben is developing a proof of concept notebook for wrapping ESMF routines with dask.delayed. The success of Jiawei Zhuang’s xesmf library shows there is great appetite for ESMF-based tools that integrate well with the python ecosystem. Mary spent a long time talking to Fernando Perez about how to best integrate NCL’s powerful computational and visualization routines with python. We are excited to see where this leads.
Pangeo started out very organically, but as the project grows, we see a need for a more formal governance structure. Having some formal governance will help us develop the project in a more focused way, coordinate our relationships with other organizations, pursue funding more effectively, and ensure wider participation and diversity. Borrowing heavily from the Project Jupyter, we developed an official Governance Repo. The basic idea is that, like many other open source projects, Pangeo will be governed by a steering council consisting of active project contributors. The details are still under discussion, and we welcome feedback from the broader community. A particular priority for the steering council will be to figure out how to make Pangeo a more diverse community, which our organic process so far has not managed to achieve. A starting point for this is our new official Code of Conduct, which aims to make Pangeo as welcoming as possible to people of all backgrounds.
Workshop Talks: Day 1
- 9:00–9:15: Welcome and Logistics — Kevin Paul
- 9:15–9:30: Introductions
- 9:30–10:30: The evolution of Pangeo — Ryan Abernathey
- 10:30–11:00: The Pangeo Principles — Jacob Tomlinson
Data Proximate Science Gateways
- 11:25–11:30: Intro — Kevin Paul
- 11:30–11:55: Jupyter Team — Yuvi Panda and Ian Rose
- 11:55–12:30: Discussion / more talks
Science Use Cases
- 2:00–2:35: Why I’m interested in Pangeo as a scientist — Matt Long
- 2:35–2:55: Moving satellite radar processing and analysis to the Cloud — Scott Henderson
- 2:55–3:15: Notebook Examples / Discussion — Rich Signell et al.
Analysis-Ready Data Formats
- 3:35–3:55: My Experience with Storing Xarray Datasets in the Cloud using Zarr — Ryan Abernathey
- 3:55–4:15: NetCDF’s plans for cloud-based data — Ward Fisher
Workshop Talks: Day 2
- 9:00–9:25: Summary of relevant Dask updates and challenges today — Matt Rocklin
- 9:25–9:45: Updates to MetPy based on Xarray — Ryan May
- 9:45–10:05: ESMF, OpenClimateGIS, and the Birdhouse WPS Stack: Connecting to Pangeo — Ben Koziol
- 10:05–10:25: NCL & Pangeo — Mary Haley
Federation and Sustainability
- 11:20–11:40: Pangeo and NASA’s Cloud Hosted Earth Observing System Data — Joe Hamman
- 11:40–12:00: Discussion around federation across multiple cloud / hpc platforms
- 1:30–2:15: Discussion of outreach efforts, training (tutorials), and diversity efforts — Kevin Paul
- Adekunle Ajayi, Institut des Géosciences de l’Environnement, Université Grenoble Alpes, France
- Andrew Pawloski, Element 84
- Aurélie Albert, Université Grenoble Alpes- IGE
- Ben Koziol, NESII/CIRES/NOAA-ESRL
- Bill Ladwig, NCAR
- Chiara Lepore, Lamont Doherty Earth Observatory of Columbia University
- Daniel Rothenberg,ClimaCell
- Fernando Perez, UC Berkeley,
- Gustavo Marques, NCAR
- Ian Rose, UC Berkeley (Project Jupyter/Earth and Planetary Science)
- Jacob Tomlinson, Met Office
- Jeff de La Beaujardiere, NCAR/CISL/ISD
- John Allen, Central Michigan University
- John Exby, Jupiter
- Joseph Hamman, National Center for Atmospheric Research
- Kevin Hallock, NCAR
- Kevin Paul, NCAR
- Luke Madaus, Jupiter
- Matthew Long, NCAR
- Matthew Rocklin, Anaconda Inc
- Michael Levy, NCAR
- Niall Robinson, Met Office Informatics Lab
- Nick Mortimer, CSIRO
- Rich Signell, USGS
- Rick Brownrigg, NCAR
- Ryan Abernathey, Columbia University / Lamont Doherty Earth Observatory
- Ryan May, UCAR/Unidata
- Scott Henderson, University of Washington
- Yuvi Panda, UC Berkeley / Project Jupyter