Pangeo and Kubernetes, Part 2: Cluster Design

Published in

pangeo

6 min readMay 30, 2019

In my last post, we looked into Pangeo’s cloud costs and discussed what it would like to budget for a typical Pangeo research cluster. In this post, I’ll present a technical and opinionated design for a typical Pangeo Kubernetes cluster. I’ll focus on the design features that impact scaling, cost, and scheduling and discuss some recent improvements to JupyterHub, BinderHub and Dask-Kubernetes that were implemented to improve behavior in these areas.

To review, we’re interested in deploying a Kubernetes cluster with this basic node-pool configuration:

Core-pool: This is where we run things like the JupyterHub and other persistent system services (web proxies, etc.). We keep this as small as possible, just big enough to run core services.
Jupyter-pool(s): This is an auto-scaling node pool where we put single-user Jupyter sessions. By autoscaling, we mean that size of the node pool (number of virtual machines) increases/decreases dynamically based on cluster load.
Dask-pool(s): This is a second auto-scaling node pool designed to run dask-kubernetes workers. The node pool is setup to use preemptible (aka spot) instances to save on cost.

Motivation

Before we get started, let me list a few of the reasons the enhanced cluster design discussed here is needed:

Optimize for rapid scaling — users want clusters to scale up quickly and admins want clusters to scale down quickly when load decreases. Early versions of Pangeo on Kubernetes suffered from long scale down times (see this GitHub issue for one such example). We determined there were two issues contributing to this problem. First, Kubernetes system pods were drifting onto nodes intended for Jupyter and Dask pods, preventing seamless scale down. Second, pods were often packed into nodes inefficiently, leading to low utilization.
Explicit node pools — We have previously used a Kubernetes feature node selectors to control where pods were scheduled. This worked well for most pods intended to be scheduled in the core and jupyter pools but didn’t provide anyway to keep system or dask pods where they belong. Additionally, it required users of dask-kubernetes on Pangeo deployments to implement a specific dask worker selector in their configuration. In all, this just ended up being really brittle and failing too often.
Cost control — related to the two points above is the issue of cost controls. The main point here is that we want to make sure we’re always using preemptible instances for dask-workers and we need ways to insure this happens.

So, given these failures, we set off on a quest to improve the infrastructure around scheduling Kubernetes pods for.

Keeping pods where they belong

Because we have (at least) three node pools with varying characteristics, we want to introduce some tools for herding pods into the right places. Ultimately, we want to make sure the JupyterHub pods stay in the core-pool, the user pods stay in the Jupyter-pool, and the Dask pods stay in the dask-pool. Kubernetes has two concepts that allow us control when, where, and why pods are scheduled to specific nodes. The first concept is pod affinity and the second is taints and tolerations.

Node Affinity — the concept of node affinity gives us the power to attract pods to specific nodes. Pods can have affinity for being scheduled on certain nodes. There are three different kinds of affinity, and then they can all be either preferred or required which is also sometimes described as soft or hard affinities. The three kinds of affinity a pod can have are node affinity, pod affinity, and pod anti affinity.

If a pod has a node affinity, it will want to be scheduled on a node with a certain label.
If a pod has a pod affinity, it will want to be scheduled on a node that already has a certain pod identified by a provided label scheduled on it.
If a has a pod anti-affinity, it will want to be scheduled on a node that does not already has a certain pod identified by a provided label scheduled on it.

As we discussed above, node affinities can be based on a variety of markers, including node labels. In our case, we want all single-user Jupyter sessions to attract to the jupyter-pool and we can accomplish this by adding the hub.jupyter.org/node-purpose: user label to the jupyter-pool nodes. We then rely on built in node affinity settings in Zero-to-JupyterHub to do the rest.

# Jupyter-notebook-pod
spec:
 affinity:
 nodeAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 nodeSelectorTerms:
 — matchExpressions:
 — key: hub.jupyter.org/node-purpose
 operator: In
 values:
 — user# Jupyter-pool-nodes
metadata:
 labels:
 hub.jupyter.org/node-purpose: user

In the deployment example below, we’ll repeat this approach for the other node pools. More details on the affinity concepts can be found in the Kubernetes documentation.

Taints and Tolerations — the concepts of taints and tolerations give us the power to repel pods that aren’t intended for specific nodes. These controls are particularly useful in a few common scenarios that we run into:

We want to keep Kubernetes system pods in the core pool. These pods are often long running tasks and can get in the way of autoscaling. By making sure all of our node pools (except the core-pool) have taints on them, we can effectively constrain system pods to the core-pool.
We want to limit certain pools to specific types of pods. For example, in our clusters, we want to only run dask-workers in node pools with the more cost-effective preemptible instances. By adding a taint to the dask-pool, we can repel all pods that are not dask-workers.

We will add taints to both the Jupyter and Dask -pools. In the case of dask-workers, we can add a single taint (k8s.dask.org_dedicated=worker:NO_SCHEDULE) that dask-worker pods will tolerate by default. Again, the Kubernetes documentation provides a good bit of documentation on this subject.

# dask-kubernetes-pods
spec:
 tolerations:
 — effect: NoSchedule
 key: k8s.dask.org_dedicated
 operator: Equal
 value: worker

# dask-pool-nodes
spec:
 taints:
 — effect: NoSchedule
 key: k8s.dask.org_dedicated
 value: worker

Scheduler

The Zero-to-JupyterHub project recently (v0.8) added a series of optimizations that give administrators options for managing the scheduling of pods controlled by JupyterHub. Two of these features are particularly impactful for Pangeo workloads. First, the userScheduler option helps pack pods full when new sessions are spawned, helping the cluster scale down more efficiently. Second, the nodeAffinity option allows us to require both user and core pods be scheduled on nodes that match the node purpose label.

# JupyterHub-values-yaml
jupyterhub:
 scheduling:
 userScheduler:
 enabled: true
 userPods:
 nodeAffinity:
 matchNodePurpose: require
 corePods:
 nodeAffinity:
 matchNodePurpose: require

Deploying our node-pools

The Pangeo documentation provides a step-by-step guide for setting up a Kubernetes cluster. Here, I’m simply going to extend that tutorial by adding the labels and taints described above. The rest of the setup of Kubernetes and Helm remains the same. I’ll provide an example of how to do this using GCP’s SDK but the basic pattern should be easily replicable on any Kubernetes deployment. Eventually, we’ll push most of this to Pangeo’s setup guide as well.

# core-pool
core_machine_type=”n1-standard-2"
core_labels=”hub.jupyter.org/node-purpose=core”
gcloud container node-pools create core-pool \
 — cluster=${cluster_name} \
 — machine-type=${core_machine_type} \
 — zone=${zone} \
 — num-nodes=2 \
 — node-labels ${core_labels}# jupyter-pool
jupyter_machine_type=”n1-highmem-16"
jupyter_taints=”hub.jupyter.org_dedicated=user:NoSchedule”
jupyter_labels=”hub.jupyter.org/node-purpose=user”
gcloud container node-pools create jupyter-pool \
 — cluster=${cluster_name} \
 — machine-type=${jupyter_machine_type} \
 — disk-type=pd-ssd \
 — zone=${zone} \
 — num-nodes=0 \
 — enable-autoscaling — min-nodes=0 — max-nodes=10 \
 — node-taints ${jupyter_taints} \
 — node-labels ${jupyter_labels}# dask-pool
dask_machine_type=”n1-highmem-4"
dask_taints=”k8s.dask.org_dedicated=worker:NoSchedule”
dask_labels=”k8s.dask.org/node-purpose=worker”
gcloud container node-pools create dask-pool \
 — cluster=${cluster_name} \
 — preemptible \
 — machine-type=${dask_machine_type} \
 — disk-type=pd-ssd \
 — zone=${zone} \
 — num-nodes=0 \
 — enable-autoscaling — min-nodes=0 — max-nodes=10 \
 — node-taints ${dask_taints} \
 — node-labels ${dask_labels}

What had to change to make this all work

Apart from the JupyterHub user scheduler that came out in v0.8 of Zero-to-JupyterHub, not much needed to change. We did make a few small, backward compatible changes to Dask-Kubernetes (see here) in its version 0.8 but other than that, everything was already possible. There were also some similar fixes BinderHub (see here) to help control the scheduling of build pods on binder.pangeo.io.

Wrapping up

Now we get to sit back and watch our clusters behave. We can already see higher use of preemptible node pools on our clusters:

Timeseries of costs on our main Kubernetes cluster. We transitioned to the new cluster design around May 6th, at which point the usage preemptible instances (red and green) increased.

Going forward, there are some interesting experiments we’d like to mix into these concepts. Two that come immediately to mind are:

including using GKE’s node-auto-provisioning for automatic management of multiple node pools.
using the userScheduler to managed node assignment for dask-workers.

Thanks to Erik Sundell, Yuvi Panda, Tim Head, Jacob Tomlinson for their help in the various development efforts that contributed to this work.