Pangeo and Kubernetes, Part 1: Understanding costs

Published in

pangeo

7 min readMay 14, 2019

When we talk to people about Pangeo on the Cloud, we’re often presented with questions about costs? How much does it cost? Who pays for it? What’s the long-term plan for Pangeo on the Cloud?

Before diving into the details, I think it is important that we revisit why we’re so excited about Pangeo on the Cloud. I’ll argue the most compelling aspects of the cloud fall into four categories.

Scale: As we move fully into the Big Data era, the cloud offers a compelling combination of large scale storage and compute. It is also accessible to everyone so we can scale access, in addition to computational workloads.
Community hub: We’ve found that something as simple as a JupyterHub on the Cloud is a powerful tool for bringing together a community of researchers and developers.
Reproducibility: It’s well known that scientific workflows are often hard/difficult to reproduce, especially in the absence of shared infrastructure. The cloud offers the opportunity to share infrastructure, an important first step in the reproducibility quest.
Extensibility: The ability to share your work and let others pick up where you left off is central to the scientific enterprise. By sharing datasets and infrastructure on the cloud, we make it much easier to extent computational scientific research.

What is Pangeo on the Cloud?

In almost all cases, Pangeo on the Cloud is a JupyterHub deployment running on Kubernetes. Kubernetes is an open-source container orchestration system for automating application deployment, scaling, and management of cloud applications. Managed Kubernetes services are offered by all of the major cloud providers (e.g. GKE, EKS, AKS). We use the Zero-to-JupyterHub with Kubernetes project’s configuration (aka Helm Chart) to deploy a custom JupyterHub with permissions that allow users to scale their computations across a dynamic cluster. As I’ll write about more in my next blog post, a typical Pangeo Kubernetes Cluster has three classes of node pools (groups of virtual machines in Kubernetes Cluster).

Core-pool: This is where we run things like the JupyterHub and other persistent system services (web proxies, etc.). We keep this as small as possible, just big enough to run core services.
Jupyter-pool: This is an auto-scaling node pool where we put single-user Jupyter sessions. By autoscaling, we mean that size of the node pool (number of virtual machines) increases/decreases dynamically based on cluster load.
Dask-pool: This is a second auto-scaling node pool designed to run dask-kubernetes workers. The node pool is setup to use preemptible (aka spot) instances to save on cost.

The Kubernetes Cluster is just the underlying infrastructure for Pangeo on the cloud. The rest of the system relies on a host of open source projects. Matthew Rocklin wrote a great blog post last year describing how all the pieces come together. Because we’re mostly interested here in how the costs break down, will stop there and reference the prior blog post for further details on the software ecosystem surround Pangeo on the Cloud.

Pangeo on Google Cloud

As part of the NSF EarthCube award that helped kick start the Pangeo Project, we were awarded about $100,000 of compute credits over three years with Google Cloud Platform. At the time, this seemed like a truly exorbitant amount but as I’ll outline below, we’ve been successful at finding ways to spend the credits. We are currently in our final year and we’re starting to plan for how to support the Pangeo Projects Cloud efforts going forward. Here’s a figure breaking down the costs in the first two years of the Pangeo Project.

Daily cost for all services used in the Pangeo Google Cloud Platform Account.

What have we been spending on our credits on? We can break our spending down into four line items:

pangeo.pydata.org: We started with a single JupyterHub running on one large Kubernetes Cluster, this was called pangeo.pydata.org. It was live for about a year (December 2017 through January 2019) and when we shut it down, pangeo.pydata.org had about 1200 unique user accounts, most of which were 1-time visitors that heard about Pangeo at a conference or workshop.
binder.pangeo.io: In August 2018, we launched binder.pangeo.io (see this blog post). It was designed to be a more natural place for 1-time visitors to try out the Pangeo system.
Pangeo-Cloud-Federation: In September 2018, we started deploying a new class of Pangeo clusters organized by research area. We’ve dubbed this system the Pangeo Cloud Federation (see this GitHub repository). As part of this effort, we’ve deployed 8 more JupyterHubs (6 on GCP, 2 on AWS). More on these individual hubs in another blog post.
Cloud storage: We’re storing about 56 TB of data in Google Cloud Storage.

Storage costs are relatively easy to explain ($35/day for 56 TB) so I’ll mostly focus on what it costs to run a single Pangeo JupyterHub. If you are interested in estimating the costs for storage and/or compute, I recommend spending a few minutes with a cloud cost estimator (e.g GCP, AWS, Azure).

Daily cost broken down by GKE SKU for the ocean-pangeo-io kubernetes cluster.

Above, I’m showing an example of the daily costs incurred for a single cluster (ocean.pangeo.io). This figure nicely illustrates the “bursty” nature of this scientific workloads and some relatively small persistent costs. From here, we’ll outline a hypothetical monthly budget for a single Pangeo clusters:

Core-pool: For our persistent services, we’ll fit them all onto a single n1-standard-2 (2 vCPUs, 7.5 GB memory) VM. This costs about $60/month.
Jupyter-pool: For our single-user Jupyter pods, we’ll choose the n1-standard-2 (2 vCPUs, 7.5 GB memory) VM and assume 20-hrs of work per week. This comes out to $21/user/month.
Dask-pool: This pool is more difficult to budget for in the hypothetical. For illustration purposes, let’s assume a single user will scale up their dask cluster for 1-hr per day, requesting 100 CPUs and 375 GB RAM. This would pencil out to about $100/user/month at standard pricing but since we’re running this pool on preemptible instances, we expect the cost to be cut by 70% — so $30/user/month.

So if I have a research group of 8, I can expect to spend about $500/month on compute costs. When it comes to budgeting for compute-heavy workloads, its important to remember that whether you run your processing on 10 CPUs for 10 hours or 100 CPUs for 1 hour, the cost will be the same.

Lessons learned

Depending on how you slice it, we’re in our second or third iteration of the Pangeo Cloud concept. We’ve learned a lot during this time and here are a few of the highlights:

Use shared user storage: by default, each single-user Jupyter pod is assigned a “persistent disk”. As the name indicates, this disk is reserved for a single user and persists even after a user session ends. This is nice because when the user comes back, they can pick up where they left off. The downside of the persistent disk is that you pay for the full disk reservation, whether you use it or not. For this reason, we’ve switched to an alternative, shared NFS storage. This is a little more expensive per GB than persistent disks, but you only pay for what you use. More details on options in this space here.
Design to scale rapidly: In the early days, we would often run into problems getting our cluster to scale up and down efficiently. Sometimes this would result in the cluster remaining in a scaled-up state for hours, costing us money even though we weren’t using it. Recent work in the JupyterHub and Dask-kubernetes projects, along with improvements in Kubernetes itself have all lessened this pain. I’ll unpack more of this concept in the next blog post.

Daily cost broken down by GKE SKU for the dev-pangeo-io kubernetes cluster. Note the increased use of preemptible instances starting on May 6th.

Use telemetry: Cloud providers sell lots of services. It can be difficult at times to track down where specific costs are coming from. We’ve been exporting our detailed billing logs to Google Big-Query and this has proved quite useful in analyzing where costs are coming from. Google also has a service called GKE usage metering, that allows us to much more accurately determine where costs are coming from at the Kubernetes object level. We’re working on publishing all of these billing logs so everyone can take a look.

Conclusions and next stages

When we started this part of the Pangeo Project, we had relatively little experience working on the Cloud. In fact, our original budget spreadsheet had more funds allocated for storage than for compute, a pattern we quickly departed from. Hopefully I’ve helped demystify some of the issues around Pangeo Cloud costs by sharing our approach to budgeting for Kubernetes clusters.

We have about one year of compute credits with GCP remaining on our initial grant. We’re at a stage now where we have a much better handle on how the various pieces fit together and we’re starting to look at various options for taking Pangeo’s cloud infrastructure to the next level. In the short term, this probably means submitting a follow-on grant proposal to Google (full disclosure: Amazon and Microsoft have also awarded Pangeo project members compute credits). Longer term, it’s probably up to research funding agencies like NASA and NSF to broker the dissemination of funding for cloud computing. This has the obvious advantage of opening up opportunities for bulk purchasing agreements at a scale larger than any single research team could manage. Finally, the bulk day-to-day maintenance of these systems is currently handled by scientists like myself. For this to scale further, we think there is likely a place for a central cloud DevOps team to help develop and manage these systems for researchers. Where such a team lives is yet to be determined.

Part 2 of this series will look into some recent improvements in how we build our Kubernetes clusters and some opinionated recommendations to improve both scaling and cost efficiency.

Pangeo and Kubernetes, Part 1: Understanding costs

Written by Joe Hamman