Deploying JupyterHub-Ready Infrastructure with Terraform on AWS

Sebastian Alvis
Apr 30, 2020 · 9 min read

An informative and instructive guide on how to deploy JupyterHub-ready infrastructure on AWS in 10 minutes.

Introduction

The Pangeo project’s mission is to enable open, reproducible, and scalable science. To this end, we have stood up a number of JupyterHubs on AWS, Azure, and Google Cloud to prototype distributed cloud computing for various institutions. We are adamant about utilizing technologies that are not locked into a single Cloud provider. While this blog post focuses on AWS, Terraform has support for all major Cloud providers, and therefore what is described here is also relevant for deploying equivalent infrastructure on GCP and Azure.

If you’ve been around Pangeo for a while, logged onto a JupyterHub, and run some easy distributed computations, the whole setup seems rather appealing. Unfortunately, when trying to deploy a Pangeo-style JupyterHub, many will find that there are a lot of configuration details that they don’t know how to work with. It can be easy to forget how much work goes into setting up a JupyterHub system.

Until now, the deployment of our JupyterHubs on AWS has required bespoke configuration of Kubernetes clusters and associated incoming and outgoing networking. Keeping track of all of these infrastructure configurations can be time consuming and difficult to document. Moreover, what if other research institutions wish to replicate what we have done? Until now there has not been an easy mechanism for us to transfer this knowledge or to enable rapid JupyterHub deployments tailored to meet specific community needs.

In comes Terraform. As an Infrastructure-as-Code language, its whole purpose is to take an infrastructure’s configuration and turn it into code, which is self-documenting, readable, and easy to send to other institutions. We can even keep different configurations for specific needs, like long-lived JupyterHubs used by a research community or temporary deployments for weeklong scientific workshops.

Note that we won’t be talking about JupyterHub deployment options. The JupyterHub really just sits on the cluster and tells the cluster how to run Jupyter notebooks and its supporting software. The infrastructure described in this post is ready for JupyterHub but doesn’t include it. Thus, I’m calling it “JupyterHub-ready infrastructure.”

This post will include a bit of background about Terraform and how I’ve used it on the Pangeo project. If you’re only interested in how to deploy the infrastructure, you can go to the Terraform deployment’s folder, which has all of the deployment instructions.

About Terraform

Terraform is an Infrastructure-as-Code language. Its job is to allow you to deploy cloud resources by writing JSON-style blocks. You have a block for each resource you want to use. For example, if you just wanted to spin up an AWS EC2 instance, you would specify a block like this:

A single command, terraform apply, would create the EC2 instance in your account. When you are done using it, running terraform destroy would remove the infrastructure. This ability to create, track, and remove costly cloud infrastructure is important to get the most from your cloud account.

In the Pangeo project, Terraform streamlines our work by replacing tools such as bash, AWS CLI (the AWS command-line interface), and eksctl (the EKS command-line tool). There are numerous benefits to making this switch, some of function and some of form.

Terraform has simplified our deployments in terms of technologies used.

Terraform has good spacing conventions for readability, which make its configuration nicer to look at than unregulated bash scripts with lines upon lines of AWS CLI commands. However, the greatest advantage of Terraform over bash scripts is Terraform’s idempotency, or its ability to know what it has created. When you run a Terraform command, it compares what you have already created and what you want to create, finds the difference, and deploys that. API commands from bash scripts will just try to create the same resources in exactly the same way, erroring out when a resource already exists. Running many terraform apply commands in a row won’t have any problems because it will just tell you everything is already created. This also allows making incremental changes much easier.

Even though Terraform configuration has decent readability, it is more complicated to look at than eksctl configuration. However, Terraform has huge advantages over eksctl in that it can deploy a much wider variety of AWS resources and is provider-agnostic. While eksctl was useful for getting cluster configurations down as code, its limitations have led us to adopt Terraform.

Some nice features in Terraform are that you can reference other blocks’ values programmatically, take in external information as data blocks, output relevant information from your resources for easy access, and more. Here is an example of blocks referencing information from other blocks. The value of module.eks.cluster_id is passed into the aws_eks_cluster data source under the input name.

Traditionally, you would keep all directly related data, resource, and output blocks in the same physical file, and have a file per logical set of blocks you need to generate. When you run a Terraform command, Terraform will look in the current directory (but won’t look recursively) and put everything together to make its execution plan. You can even collect all of these resources together and make them a reusable module, for your own or public use.

Modules greatly improve Terraform’s readability and reusability. I use two modules in this infrastructure to keep the configuration shorter, which are linked at the end of the post. They are maintained by the Terraform team, so I trust their quality, and it enables me to focus on some of the higher-level details, rather than get stuck in the weeds networking all the machines together.

The two modules group together the deployments for the VPC and the EKS cluster, respectively. I tell Terraform where to find these modules, so that when we deploy the infrastructure it can go and gather the resources in the modules. Those resources are then added to the list of resources present in my local files for deployment.

The Deployment Configuration

About the Infrastructure

A diagram of the infrastructure we will deploy.

The aws directory contains all the files for the infrastructure’s deployment. There are various inputs you can configure, such as the names of the biggest infrastructure pieces, where you want to deploy them, and who will be deploying them. All of the options and their default values are specified in variables.tf and are referenced in other files via lines such as var.cluster-name. You will specify other values by filling out your-cluster.tfvars in the deployment instructions.

The bulk of the content is in main.tf. It specifies a fixed version of Terraform and some data sources. There are also provider blocks, which are downloaded plugins for Terraform to interact with other software tools. You will also see blocks for both modules mentioned earlier.

The VPC module is configured to use both private and public subnets and has some settings to enable communication from private subnets to the internet. It is important to keep as much infrastructure as possible on private subnets in order to limit unauthorized access from the internet. There are also tags that help configure the interaction between the VPC and Kubernetes.

The EKS module tells us that the cluster will sit on the private subnets and have 3 sets of nodes: core, user-spot, and worker-spot. Core nodes are where the JupyterHub would be deployed, and we are fine using just one kind of machine, so it is listed as a worker group. The user-spot nodes are the machines where users would interact with their notebooks. If the user wanted to deploy dask workers to speed up their computations, the JupyterHub would put the dask workers on worker-spot nodes.

User-spot and worker-spot nodes are configured as worker group launch templates. This is because they use features known as spot instancing and autoscaling to zero. Spot instancing is a feature where the launched nodes are the cheapest available instance among a couple types we define. This feature saves money in the long run, though it may boot you off of a node if there is considerable demand in your region (this generally happens less than 5% of the time on AWS). Autoscaling to zero is also a big cost-saver, as unneeded worker and user nodes will be terminated if they are not in use, so you don’t pay for nearly as much time on machines. These features require several tags, taints, and labels, seen in their worker group configurations.

However, we need a specific deployment to enable the cluster’s autoscaling. In autoscaler.tf, we include a module to create and link a kubernetes service account with some IAM resources for the autoscaler and a Helm release with the autoscaler’s software. This piece will enable the cluster to bring in new nodes when users login and release the nodes when the users leave.

As an example of persistent user storage, we provide the contents of efs.tf. The AWS Elastic File System can attach storage for a user in any availability zone, which is relevant because we have specified three implicitly with our choice of three private subnets. The EFS is its own resource and has a set of mount targets and security groups to manage its interaction with the VPC. We create a namespace for the efs-provisioner and then deploy the provisioner with Helm. The provisioner will automatically allocate a subfolder of the EFS drive for each new user and reattach it every time that user logs in.

Outputs were mentioned earlier and are present here in outputs.tf, which is a standard file for many Terraform deployments (it is in present in both modules!). You can take a peek in the file if you like, but you will see its results every time you successfully run a terraform apply command.

Deploy the Infrastructure

Full deployment instructions are located in the ReadMe file for the associated GitHub repository. Go there now if you want to deploy the infrastructure! Creating these resources will of course cost your AWS account money. This cluster configuration has cost me under $5 per day running the cluster, vpc, and core node. Fortunately, after you’ve experimented with this deployment, just run terraform destroy and you are back to zero cost!

Conclusion

Hopefully, you now have a decent grasp on why Terraform is a useful tool and how you can use it to manage cloud deployments.

If you wanted to launch infrastructure with a different cloud provider, know that it should be possible! As mentioned earlier, the main barrier is to find and learn how to interact with modules for that cloud provider. Good places to look for GCP and Azure are the Google Cloud and HashiCorp GitHub organization and the Terraform Repositories under Microsoft Azure’s GitHub organization.

If you want to investigate putting a JupyterHub on this infrastructure, I would recommend looking at the following GitHub repositories: hubploy, zero-to-jupyterhub-k8s, and pangeo-cloud-federation. Just remember that if you put anything onto the infrastructure, you should remove it before you run terraform destroy ….

Additionally, I would recommend reading the Zero-to-JupyterHub-K8s guide. I spent some time reading that, picked up some Terraform, and mashed the two together to test full JupyterHub deployments. Zero-to-JupyterHub makes no assumptions about your use-case but points you in the direction of possible upgrades and customizations.

This guide, on the other hand, is very opinionated. There are several choices you could make differently than I did and they are pretty easy to change if you want to do so. If you want to develop a general use-case or improve open-source JupyterHub-ready deployments, consider contributing to the terraform-deploy repository.

Acknowledgements

The work and knowledge in this post was made possible by many people and institutions, including Scott Henderson (UW), Yuvi Panda (Berkeley), Anthony Arendt (UW), Joe Hamman (NCAR), the Space Telescope Sciences Institute, the University of Washington’s eScience Institute, and the Pangeo project. This project was supported in part by NASA-ACCESS grant #80NSSC18M0156 and AWS Cloud Credits through https://aws.amazon.com/earth/research-credits.

GitHub Links

pangeo

A community platform for big data geoscience