Deep Learning with GPUs on Pangeo
by Jeffrey Sadler, Joe Hamman, and Scott Henderson
Intro/Motivation
Deep Learning (DL) is rapidly growing as a powerful tool for scientific and commercial applications (for example, predicting the Covid19 protein structure, predicting severe weather, and real-time speech translation). To support DL modeling in the cloud, we created a DL-specific deployment of Pangeo. Pangeo DL combines the scalable processing tools and computational resources of our previous Pangeo deployments with the addition of DL-specific hardware and software, all on the cloud.
DL workflows have some different requirements than the use cases that Pangeo has tackled to this point. While the IO-heavy use cases that Pangeo has largely focused on are capable of leveraging 100s of CPUs through parallelization libraries like Dask (e.g., analyzing 100s of terabytes of output from Coupled Model Intercomparison Project 6 (CMIP6)), DL workflows commonly require a different architectural design. DL in particular, greatly benefits from more exotic hardware, like GPUs which make DL model training much faster than using CPUs. To support this, we’ve added functionality to a few Pangeo cloud deployments that allows DL modelers to access a custom environment with a GPU and GPU-enabled Python libraries.
Implementation
We have deployed Pangeo DL in both the AWS and GCP Kubernetes services. Implementing the Pangeo DL deployment consists of three major tasks:
1. Defining the Docker image with DL and GPU libraries
In addition to the staple Pangeo packages (e.g., Dask, Xarray) we need to include libraries in the Docker image to support DL. These include DL libraries like Tensorflow and PyTorch, more generic Machine Learning (ML) packages like scikit-learn, and GPU-specific Python libraries like CuPy. We regularly update these images and push them to DockerHub for use in any JupyterHub system.
docker pull pangeo/ml-notebook:latest
More details on the workflow for creating and maintaining these images in this blog post:
2. Making a node pool with GPUs
Because our Pangeo JupyterHubs run on top of Kubernetes, we also need a node pool in which the nodes each have an attached GPU. Most cloud providers make this pretty easy. Here’s an example of how to do it with Google Cloud:
$ KIND="nvidia-tesla-p100"$ gcloud container node-pools create gpu-pool \ --accelerator type=${KIND},count=0 \ --zone ${ZONE} --cluster ${CLUSTER_NAME} \ --num-nodes 3 --min-nodes 0 --max-nodes 5 --enable-autoscaling
And here is a GitHub gist with the AWS configuration using the eksctl
command line tool for cluster creation.
3. Configuring JupyterHub to connect with the GPU
Once GPUs are included in the node pool, the final step is to configure JupyterHub to utilize the new GPU resource when users launch a notebook server. This link shows how this is done in the JupyterHub configuration.
singleuser: profileList: - display_name: "GPU Server" description: “Spawns a notebook server with access to a GPU” kubespawner_override: extra_resource_limits: nvidia.com/gpu: “1”
Configuration on AWS is identical except for setting an additional environment variable. Described in detail here.
Using Pangeo DL
With Pangeo DL set up (GPUs attached to the node, DL/ML packages installed, JupyterHub configured to access the GPUs), we can train DL models leveraging the GPUs. For one of our models, this made training a matter of minutes with a GPU instead of hours using a CPU! Here’s a screencast of an example DL model running on Pangeo DL. The model is trained to predict streamflow in a stream segment in the Delaware River Basin given inputs such as precipitation and air temperature (link to notebook).
Now you might be thinking, “Why not just use Google’s Colab?” That’s a fair question. Google Colab service provides users an interactive notebook environment and access to GPU resources for free. This platform is a really useful tool to many students and scientists. However, there are a few advantages of Pangeo DL:
- By having our own Pangeo DL deployment, we can be in total control of the environment. We can install whichever libraries we want and choose the size and type of nodes and GPUs.
- Because it is a Pangeo deployment, we can take advantage of all the great things about the Pangeo deployments already implemented, including using Dask to scale our pre- and post-processing workflows taking advantage of CPU clusters, and instant access to cloud-based data catalogs.
- Pangeo DL gives users a fully customizable JupyterLab workspace including a file browser and the ability to run a terminal and terminal programs like iPython and Vim.
- Users can use Pangeo-DL in the exactly the same computational environment wherever large datasets of interest are hosted (such as CMIP6 on Google us-central-1, or NCAR’s CESM Lens on AWS us-west-2). This flexibility helps us avoid vendor lock-in and promotes extensibility.
- Pangeo DL is agnostic to the actual cloud provider. By building on top of Kubernetes, we’ve been able to deploy Pangeo DL on both AWS and GCP with minimal custom effort. To the user, the interface to GPUs on the cloud is the same, regardless of which vendor is providing the underlying hardware.
Conclusion
Pangeo DL is a customizable platform that has all the advantages of our previous Pangeo deployments and additionally offers powerful DL functionality. Adding this functionality to existing Pangeo infrastructure was straightforward thanks to cloud-agnostic platforms like Kubernetes, the open-source approach of Pangeo, and the architecture and pioneering documentation of the Zero-to-JupyterHub Project. We think this is an exciting first step in enabling reproducible scientific analyses that require high-performance infrastructure. Look for future blog posts for more on how we are using this our newly implemented Pangeo DL functionality!