Our new Pangeo architecture

Published in

Met Office Informatics Lab

11 min readOct 28, 2019

Understanding Pangeo isn’t essential to this post but if you are new to Pangeo and want to learn more why not start at the post “What’s so cool about Pangeo” or at pangeo.io.

By utilising an as-a-service Network File Service (NFS), Azure NetApp Files, we were able to strip the Conda environments from the container image at the heart of our Pangeo service. Advantages of this include a 10 fold reduction in the container size; improved start-up and scale-up times; better environment management; and custom environments for distributed jobs.

A before and after shot of our container image

The status quo

Our outgoing pangeo architecture with EBS backed home directories and the Python environment built into the container.

Our outgoing stack is a fairly typical Pangeo set up (though setups do vary a lot). In this setup you have a “beefy” container that has just about everything you need; all the python modules you need for your science; the Jupyter lab and server install; the OS; and the text editors and other tools you might find useful. When a user starts their session this container is spun up for them and their home directory is mounted in. This home directory is backed by some persistent storage (EBS in our case). At this point the user has a Jupyter Lab environment they can work in and somewhere persistent to keep their work. When a user wants to distribute a workload we encourage using dask-kubernettes. For each distributed worker they spin up another copy of this container will be spun up but this time without the home directory mounted in (EBS is only mountable to one container at a time).

Why we wanted something different

While this setup has served us well and has a lot going for it there are a number of niggles with it. I present the main ones below.

Conda environments are not persistent

With our current stable environment, the default install location for Conda environments is /opt/conda. This is in the file system of the running container (rather than external mount). This means any changes to or creations of Conda environments are lost when your session ends. For this reason many of our notebooks are littered with %conda install XYZ -y magics that get commented in or out as and when. We could config so Conda would default to installing environments in the users home directory. This would allow environments to persist but because of the next point this would only be partially beneficial.

Users home directories are not visible to distributed workers

Many libraries put settings and configurations in the home directory. Simple things like default colour schemes or cloud provider credentials can cause real headaches when they aren’t available to your distributed workers. This is nearly always ‘work-around-able’ but causes pain and delay.

The bigger issue is that python (or other) environments are not necessarily the same on the distributed workers as the master notebook. Conda magics like those mentioned above or pip install --user X commands are regularly used by our users to get their environment how they want it. However, when they come to distribute the workload they discover that the workers don’t share the same environment and their code fails. Again there are workarounds, my colleague Rachel Prudden is a master at creating fixed-sized clusters and installing the necessary environment on each worker before proceeding with an analysis. This however is a big pain and erases much of the exciting elasticity of a Pangeo stack, one of its key ‘super powers’. Because painful hacks are needed to run distributed workloads with custom environments we don’t make the users home the default Conda install location. We prefer putting the pain of transient environments upfront and experienced early rather than have it only come to light and muddying the waters when working on distributed code.

Lack of environment versioning

With the Python environment built into the container the only versioning of the scientific environment available is the versioning applied to the container (what tag it has in DockerHub in our case). To illustrate an issue with this considering the following scenario. You create Notebook-A today on container version 1.0.0. Sometime later the container version has moved on, now 1.5.0, and you create Notebook-B. Now both notebooks (A and B) will refer to the same kernel (‘Python 3’) however Notebook-A might now not work as expected (or at all) because in reality, the environment has changed (because the container version has changed). Further, the notebook will not have any record of which environment it was originally created on (it just knows it used kernel ‘Python 3’) so reproducing the original environment could be impossible. Even if you were fastidious enough to record which version of the container you were using then it would be impossible to work on Notebook A and Notebook B simultaneously. This is because each time you needed to run code on the other notebook you would have to shut down your server and restart it with a different version of the container, losing all your active kernels.

Poor scaling characteristics

Due to the size of the image, it takes a long time to pull down, unpack and spin up. This makes scaling our cluster slower because any new node needs to go through this process. This is exacerbated because often when a new node starts up many workers will jump on to it and they will all try to pull the large image in parallel making it take even longer. The sheer size of the image has also prevented us from working with some interesting technologies. We have had big problems with the utility of both AWS Fargate and Azure Virtual Kubelets due to the size of our image.

Our experimental architecture.

Our experimental pangeo architecture with NFS backed home spaces and Python environments. Home spaces and Python environments can now be shared with the distributed workers.

The main difference with our experimental architecture is that we are utilising Azure NetApps a product that offers NFS volumes as a service. The big advantage of an NFS volume is that it can be mounted into many containers simultaneously. In this architecture the users’ home spaces are on the NFS and so are all the Conda environments that we need, both to run Jupyter and to perform our scientific analysis. Because we are storing our Conda environments elsewhere our container image becomes essentially just the base miniconda3 container with a few utility scripts thrown in. This container is a 10 fold reduction in size compared to the container in our outgoing architecture.

How this improves on the status quo

Conda environments are either read-only or persistent.

The Conda environments that we manage and provide for both analysis and running the UI (Jupyter Lab) are mounted as read-only. This means our users are not confused or surprised by libraries seemingly installed one minute and gone the next. Inevitably as a result of these read-only environments users will create new environments. We have made it the default that these are stored in the users home directory and will therefor be persistent. This is appropriate because these environments can now be shared with the distributed workers (more on this next).

Users home directories and Python environments are shared with distributed workers

Because NFS is mountable on many pods simultaneously we can now share the users home space and Conda environments with the distributed workers. This means that the environment is the same on the distributed works as the notebook that initiates them. This makes it quick and easy to create distributed workloads under a custom environment.

Better versioning of python environments

By separating the versioning of the container from the version of the Conda environments we should be able to offer better control and reproducibility of our scientific code. Let’s revert back to our previous example of Notebook-A created today and Notebook-B written a few weeks later. In or new parigdime environment updates are different Conda environments. So Notebook A may refer to kernel /env/datasci/0.0.1 and Notebook B would refer to /env/datasci/0.0.2. Assuming we implement a sensible naming convention (such as above) then the version of the environment used is stored in the notebook by default. It’s also easier to support multiple versions of the environments simultaneously because they are different conda environments installed at different locations. These environments are represented by a conda environment.yaml file that we will publish so even if we stop hosting an environment it would be easy for a user to recreate it if they wished to continue using it.

A final note on this is that we no longer muddy our scientific analysis environment with the UI environment that runs Jupyter Lab. We can iterate, update and manage them separately.

Faster environment builds

In the outgoing architecture to release a new environment we had to build the whole container, this took an age. Now you just build the conda environment you need (this can still take some time). As an added benefit you don’t need to restart your server, the environment is available to all as and when it’s built.

The downsides

Of course nothing’s perfect, here are the main issues I’ve noticed so far (it's still early days). Many of these issues are because we’re still learning and I’m sure will be ironed out. Others may be more fundamental.

Noticeable lag on Python imports

Running the Python import command, particularly on multiple large libraries is noticeably slower with the environment on the NFS rather than in the container's filesystem. This can result in an irksome delay of 1–5 seconds when you start working with a notebook. This goes away once the library is hot (has been loaded in this or another notebook). I’ve noticed no other performance issues so far though care should be taken to ensure any swap-like activity is written to the container (such as at /tmp) rather than the NFS (such as ~/).

Not building on the pangeo base image

The Pangeo base-notebook image is the lightest weight container from the Pangeo community but still comes in at ~750M. This is because it contains dask, Jupyter and various other libraries considered essential for running pangeo. To get our image size to a fifth of this we’ve build on top of the miniconda3 image instead. However we now have to reimplement useful features (such as dask config) and don’t benefit from downstream improvements. Having experienced this it would be great to think about how we can further push Pangeo a collection of mixins that you can include rather than a stack you build on top of. How/if to do this is for another post…

Running as root

I’m sure this one is just down to ignorance but currently, the only way we’ve got the NFS to mount and JupyterHub to automatically create the home directories is to run the notebook as the root user. This has security implications and resulted in some horrid hacks (like symlinking /root to /home/jovyan). I’m confident this can be resolved.

No auto-provisioning

Currently we do not have auto provisioning of Persistant Volumes (PVs) form PV Claims (PVCs) for our NFS storage. Jupyter Hub does a good job of getting around this by using one PV and one PVC and automatically creating new subdirectories for each user. However, I think the auto-provisioning you can get using EBS or Azure Files or other storage backends is much cleaner.

There is an auto provisioner that we could use (nfs-server-provisioner) which would be a tidier solution and we will probably use. This provisioner just creates new directories under an existing NFS volume rather than what I would rather, new volumes for new users. Apparently the Trident orchestration tool will do this auto-provisioning for NetApp Files. But this requires running Trident on the master nodes something we can’t do using the fully manage AKS service. I think what we want to achieve would be possible using a Flex Volume driver and so I’m optimistic this will happen in the future.

A more complicated build process

While it probably true to say this setup increases the complexity of the build and deployment I think it’s worthwhile. I also think as the components and the thinking matures it should become as easy and manageable and any other Pangeo deployment.

Next steps

It’s still early days for this architecture and I’m excited to see what more we can squeeze out of it. Here are some of the exciting next steps we hope to explore.

Virtual Kublets

Azure’s managed Kubernetes platform allows you to hook together their managed Kubernetes service with their Container Instance (CI) service using Virtual Kublets. In effect you get a node in your cluster with infinite resource as every pod is spun out to run on the CI Service. We excited to use this service to quickly spin up and down) our distributed workers. Previous trials of this were abandoned because of issues caused by the size of our image. We also hope to utilise Virtual Kublets to allow spinning up of GPU powered containers for our machine learning workloads.

Precahcing and NFS tuning

With our current design all our NFS mounts are either mounted as Read Many— Write None or Read Many — Write One. It’s my belief that we might be able to tune the NFS mont options or use on container pre-caching to speedup the apparent performance of the NFS system.

One-click Binder

We’ve now separated out the conda environment form the container and the notebook knows what environment it’s running under. My hope is this makes it easy to offer a ‘one-click’ deploy to My Binder. This would create a repo complete with the necessary environment.yaml, your notebook(s) and README.md with links to launch on Binder. This was already possible but I think now it would be easier and cleaner.

Further shrinking

I might just be getting carried away but I’m interested to see what happens if we install miniconda on the NFS storage and remove this from our image. With this set up we could use Ubuntu as our base image and get our container size down to ~25MB.

The code

I’ve left this to the end as I think the concept is more important than the implementation and because the implementation is very much a work in progress. I’ve linked to specific refs and branches because I’m hoping these repos will continue to develop and future code may no longer be in the same context as this post. You won't be able to just pick up and run with these repos due to their R&D nature, lack of documentation, etc. Hopefully in time things will stabilise but in the meantime please reach out if you have any thoughts, comments or questions.

Environment builder

The pangeo-envs repo is the builder for the different environments. It takes theenv.yaml file and builds it on a container on our cluster. The resultant environment is saved to our NFS storage. This process is triggered by Travis on a push or tag. Due to some regrettable design each branch is a different environment on our dev platform and each tag an environment on our stable (but still dev) environment. This needs doing differently!

Cluster build scripts

The repo our-kubernetes contains the build scripts for our Kubernetes cluster. It’s got scripts for our AWS and Azure cluster but the Azure section is the relevant one to this post. There are many branches reflecting the various experiments we are working on but the above link should take you to the most relevant one.

The Container

Our container is hosted on Dockerhub at informaticslab/panzure-shared-env-notebook the source is in the repo panzure-shared-env-notebook.

And finally…

It’s very early days for us on this architecture but I’m excited. It may prove a mistake and definitely needs further work and community input but I think there is a lot of potential. Thank you kindly for reading this far, we would really appreciate hearing your thoughts, comments and questions.