pangeo

A community platform for big data geoscience

Managing Pangeo Environments for Distributed Computing

--

By Tom Augspurger, Scott Henderson, and Joe Hamman

Managing computational environments has been a persistent challenge for both administrators and users of Pangeo distributed computing infrastructure. While there are still improvements to be made, we’re reasonably happy with our current setup. This post will discuss

  1. The challenges we face,
  2. Our goals for Pangeo’s environment management system,
  3. The solutions we’ve implemented, and
  4. Guidelines for users wanting to customize an environment.

But first: what do we mean by environment management? Roughly speaking, we’re talking about the set of packages (typically Python packages with C or Fortran extension modules) available to users when they start a session on Pangeo. A session is typically, though not exclusively, a JupyterLab session.

The Challenges

1987
https://xkcd.com/1987/

While Python environment management can be humorously complex, we have a few additional complexities. First, we’re using Dask for distributed computing, so we need to ensure that the environments on the client, scheduler, and worker machines in the Dask cluster are compatible.

Second, the user environment has to work in a somewhat complicated “runtime” that involves UI elements like JupyterLab and distributed computing elements like Dask and Dask Gateway. In the current design, these “runtime” packages must be in the user environment. We’ve found it difficult to ensure that the environment brought by the user contains the versions of, say, Dask that is expected by the runtime. This is a subtle point worth repeating: both users and Pangeo administrators want to dictate specific package versions in the user environment. That negotiation has been one of the toughest technical problems to solve.

Finally, we can’t tell our users to “just build a Docker image”. We’re using Kubernetes for the cloud deployments, so at the end of the day we do need a Docker image. But building that Docker image has to be Pangeo’s burden, rather than the user’s, who are primarily scientists. We can’t expect that its users are familiar with the intricacies of Docker, and to the extent possible, we’d like our users to not even notice environment management.

The Goals

In the end, we have a few requirements for our environment management system. We’d like something that

  1. Builds reproducible Docker images for the client, scheduler, and worker pods that satisfy the runtime constraints of the Pangeo deployment (e.g. the version of Dask-Gateway in the environment must be compatible with the version of Dask-Gateway running on the Kubernetes cluster).
  2. Provides extension points to customize the built image at various stages, including the set of packages installed, environment variables set, etc., in a way that’s consistent with the first requirement.
  3. Doesn’t require knowledge of Docker, but provides custom Dockerfiles as an extension point.
  4. Is reasonably simple to understand and maintain.
  5. Builds relatively small Docker images for faster startup times when the cluster is scaled.

To achieve those goals, we’ve implemented a system that’s working pretty well.

Our Setup

Our current setup involves three high-level pieces:

  1. Upstream packages
  2. Conda metapackages
  3. Pangeo docker images

Combined, these pieces resolve to a specific environment for a user. Let’s go through them in turn.

Upstream packages: An upstream package is something like Dask or JupyterLab. These are packages in the user environment that are necessary for the “Pangeo runtime”. The Pangeo administrators control this and would like to keep up to date with new releases. We’ve enlisted pangeo-bot to handle regular package updates. When, say, Dask issues a new release to conda-forge, pangeo-bot automatically makes a pull request to update pangeo’s metapackages. More on the pangeo-bot in a future post.

Pangeo’s metapackages: Pangeo contributors maintain several conda metapackages, a conda package with no files, only metadata. These meta packages allow us to group dependencies and ensure consistent versions across environments.

Today, we have two meta packages available on conda-forge:

  • pangeo-dask: Packages related to using Dask on Pangeo. These packages need to be in both user and worker environments.
  • pangeo-notebook: Packages that are only needed in the user notebook environment. Here we include things like JupyterLab and the Dask-labextension.

Pangeo’s Docker images: Pangeo deployments are running JupyterHub on Kubernetes, which requires that a user’s Jupyter environment be encapsulated in a Docker image. We have created a new repository pangeo-docker-images, which provides continuous building of reproducible Docker images that contain compatible Conda packages using the machinery described above. In brief, this repository controls:

  1. The pangeo/base-image image, which can be inherited from and customized with repo2docker-like extension points.
  2. Ready-to-use images like pangeo/pangeo-notebook and pangeo/ml-notebook, which provide a consistent environment for testing and common use-cases.

The repo2docker-like extension points in pangeo/base-image are crucial. These are configuration files specified by repo2docker like environment.yml, postBuild, and start, that allow for users to control certain stages of the Docker building process without having to understand Docker. We previously just used repo2docker directly, but struggled with issues around the complexity of the setup, the size of built images, and Conda timeouts when updating packages. Using a custom base-image gives us more control over how conda commands are run, and allows us to significantly slim down the size of images — our base-notebook environment is now just 250Mb compressed.

Customizing environments

For Pangeo Binder users, the easiest and most common customization is adding specific packages to the environment. For example, suppose we want an environment with xarray, hvplot, and sat-search. This can be done by including the following files in the binder repo:

The binder/Dockerfile ensures that you use our base image, which supports the same customization points as repo2docker. The binder/environment.yml provides a place to specify the conda packages you want in the environment. Including the pangeo-notebook==2020.05.04 metapackage ensures that the versions of dask, jupyterlab, etc. in the environment match the versions expected by the Pangeo runtime.

The pangeo-binder-template repository provides a template repository with all the necessary configuration for users to create custom environments.

For many users, however, this even this level of customization is overkill. There’s a set of common packages used by the community that satisfies the needs of most users. So we’ve developed a “default binder” repository with a common set of packages that can be used by many binders. More on that in a future post.

Summary

Managing computational environments for Pangeo JupyterHubs and BinderHubs has been challenging. Distributed computing means that we have multiple machines whose environments must share a common environment. Additionally, the “Pangeo runtime” requires specific versions of packages installed in the user’s environment. Finally, it’s hard to build a single environment that works perfectly for everyone, so we want to enable users to customize the environment at various places.

The conda metapackages managed by Pangeo ensures that the user environment has the versions of Dask, Jupyterlab, etc. expected by Pangeo. These can be used with Pangeo’s Docker images with repo2docker-like extension points for successfully creating an environment that fits your needs.

--

--

pangeo
pangeo

Published in pangeo

A community platform for big data geoscience

No responses yet