Pangeo meets Binder

Joe Hamman
pangeo
Published in
5 min readSep 17, 2018

Over the last year, we’ve primarily focused on Pangeo’s cloud-based JupyterHub deployment concept (see the figure below for a schematic and Matt Rocklin’s blogpost from January). Our flagship deployment, pangeo.pydata.org, builds on top of Zero to JupyterHub with Kubernetes, and has offered our community a chance to imagine what doing science on the cloud might look like. The key features of our JupyterHub deployment are:

  • Runs on Google Cloud Platform using Google Kubernetes Engine (GKE)
  • Provides multi-user gateway with easy-to-use user authentication (Github)
  • Presents familiar user interface in the form of Jupyter Notebooks
  • Scalable data storage using object stores (Google Cloud Storage)
  • Data proximate scalable computation using dask and dask-kubernetes backed by xarray and the rest of the scientific Python ecosystem
A simple schematic describing how Pangeo envisions a data proximate science platform using Jupyter as the user interface.

The development and use of our JupyterHub deployments has taught us some important lessons. These important lessons, range from highly-social to highly-technical. Here are a few of the most relevant lessons and takeaways:

  • There is a lot of excitement across a broad range of communities around doing science using cloud computing.
  • Managing very large JupyterHub clusters is expensive. Since March 2018, we’ve had over 1000 users visit pangeo.pydata.org. This is awesome but has also yielded some unexpected maintenance costs. These costs were not just related to our monthly bill with Google, but were also borne out in the form of administrative time.
  • Many users want many different things. A byproduct of our broad engagement with the scientific computing community has been the request for a many user environments. For mostly administrative reasons, it has been difficult to satisfy many of these user requests.
  • We need a way to package and share what we develop on our JupyterHubs. This becomes particularly important when the JupyterHub doesn’t exist anymore.

The sum of these lessons led us down a new development path. The first step in this development path included rolling out a “Pangeo BinderHub”, the primary topic of this blog post. The second step will focus on the federation of the Pangeo JupyterHub concept across many smaller, domain specific groups. More on that topic in a later blog post.

Binder

Binder is a tool that allows users to take a collection of Jupyter Notebooks, package them in a GitHub repository with some configuration files that describe the software and computing environment, and share them with remote users. Those remote users can then, with a single click, launch a Jupyter notebook server (or a RStudio server) and reproduce the original analysis. The Binder concept offers the scientific computing community something we’ve desperately needed - a high-level framework for packaging and distributing our computing environment, software, and analysis. These are key components that will help us improve the reuse and reproducibility of our science.

Example of the user interface from mybinder.org. Click the launch button and run the dask-examples!

The Binder team has written a detailed blog post describing the technical details of Binder. They also have deployed and maintain a popular BinderHub deployment running at mybinder.org.

Pangeo’s Binder

Over the past few months, we’ve been working on a new Binder deployment that combines some of the core components of Pangeo’s JupyterHub deployment with Binder. Today, we’re excited to announce the beta-launch of binder.pangeo.io.

A quick demo of binder.pangeo.io in action.

Our goal in developing this Binder deployment was to build on the great work of the Binder team and to add some additional functionality related to working with large scientific datasets. The main distinction, relative to the mybinder.org, is the ability to deploy dask-distributed clusters via dask-kubernetes. This allows us to use a familiar user interface (a Jupyter Notebook), distribute large computations across many remote workers, and direct access to large datasets in proximate cloud storage. For users that need a custom software environment to run their analysis (something that was difficult to do on pangeo.pydata.org), Pangeo’s Binder deployment will offer a much more flexible framework. Users should note that there is no persistence in the Pangeo Binder, so this is unlikely to serve as the perfect platform for day-to-day computing.

Give it a spin

By now, we’ve hopefully piqued your interest. The icon below will let you give the Pangeo Binder a test drive with a few examples we’ve put together. These examples each use dask-distributed and most of them are accessing datasets stored on Google Cloud Storage.

Click the button above to give pangeo.binder.io a try!

Building your own Pangeo Binder enabled repository

Now it’s your turn! We developed some online documentation to help you get started packaging your own workflow. The examples referred to above are also available as a good starting point.

Future development

We are just getting started with Binder. We are excited to see how the community uses Binder for sharing Pangeo tools and workflows. In the short term, we’ll be working on some new features on binder.pangeo.io, including user authentication, user quotas, and usage analytics. We also hope to continue to push new features and bug-fixes upstream to the Jupyter and Binder ecosystems. Longer term, we’re excited to explore new ideas related to publishing scientific findings with Binder and Pangeo.

Acknowledgements (alphabetical order)

Ryan Abernathey (Columbia University) Joe Hamman (NCAR), Tim Head (Wild Tree Tech), Chris Holdgraf (UC Berkeley), Yuvi Panda (UC Berkeley), Min Ragan-Kelley (Simula Research Laboratory), and Matthew Rocklin (Anaconda Inc.) all contributed to the development of pangeo.binder.io and/or this blog post. Special thanks to the Jupyter and Binder teams for their invaluable help along the way.

The Pangeo project is currently funded through grants from the National Science Foundation and the National Aeronautics and Space Administration (NASA). Google provides compute credits on Google Compute Engine.

--

--

Joe Hamman
pangeo
Editor for

Tech director at @carbonplan and climate scientist at @NCAR. @xarray_dev / @pangeo_data dev.