Singularity Containers as an On-node Resource Manager for HPC

Published in

SingularityApp

7 min readNov 20, 2019

This article is based on a software design published and presented during the Canopie HPC workshop held in conjunction with SuperComputing 2019.

Buckle up, we are about to dive into the architecture of High Performance Computing (HPC) systems, such as Summit at Oak Ridge National Laboratory, the most powerful computer on earth at the time of the redaction of this article. But you will see, I think it is quite interesting and we have a good idea about how to better manager resources on such big systems.

How are these HPC systems architectured and used?

The first thing to remember is that even if new systems are not strictly adhering by the rule Iam about that state, the fact is that almost all HPC systems are designed to run applications based on the Message Passing Interface (MPI). It is true that the addition of accelerators make these systems not strictly MPI centric but the vast majority of applications are still based on MPI and evolving to become MPI+X. What does it mean? That both the systems and applications are designed to be tightly coupled with point-to-point and collective communications.

Interestingly, it also means that the way we run applications is historically based on the MPI model: you submit your job, requesting a specific set of resources and when your job is scheduled, your application can only use the resources that were allocated in the first place. You do not know how much resources your application need? Well, make a guess and pay for what you get. Don’t get me wrong, i am not arguing that it is bad, scheduling applications, accounting for the resources that are used at large scale are not trivial things to do and, again, even today, most applications actually fit well with that model.

But practically, it means that HPC systems are almost always designed around the following concepts:

You have a set of login nodes where users land when connecting to the platform. These nodes have a set of compilers, tools and everything you need to prepare your application(s), gather input files and store your results. As a reminder, no one is supposed to run applications on these nodes (which i see way too many times and if you do that, you actually make the life of a lot of other users extremely difficult — and yes, I am one of “those” who will kill the applications that are running on login nodes and use all the available resources.)
Once your application is ready, you can submit it. Once you identify your input files and the command to start you application, you basically have a job. You change your input, it is another job. Anyway, you are ready to submit a job. Most HPC systems will have a job scheduler and resource manager to handle that.
Once your job submitted, it is queued until the resources you requested are available. You see now why it is “convenient” to assume that you know how much resources you need for your job? So that the entire system can be shared by multiple applications. Also for another reason that you are not always aware of: so that your account/project can be charged for the resources you use. This is because tracking in real time what resources are used in very large systems would be extremely expensive and historically not practical.

So, what are the points to remember here: all large scale systems have a job/resource manager to allocate, track and charge you for the resources you are using and this is why you are asked to request resources upfront.

Why does it matter? Since most system are still heavily designed based on MPI, it means that resource managers (and you understand now that they are a center piece of their design) are also designed around MPI, even to these days.

How are processes/threads created when my job is deployed?

You should by now be familiar with the concept of job/resource manager. Once the job is scheduled, and let’s assume here that we are running an MPI application, the MPI ranks need to be created on the appropriate cores on compute nodes. How does that happen? Well it depends on your system and ultimately it depends on the scale of your system. For small(er) systems, the MPI runtime will most certainly initiate ssh connections to nodes, create an entry point on the compute node (in the context of Open MPI, an actual process that behaves as a daemon, called ORTE) and from there start the MPI ranks. On larger systems, the resource manager usually already have a persistent service on compute nodes to quickly and efficiently start new processes and threads. For MPI, that service will receive data about how many ranks to create and where to create them. To be even more concrete, on Summit the IBM infrastructure provides a daemon on the node that can talk to the global resource manager and start ranks.

Over time, the community started to work on specifications/standards (some people are pedantic about the difference between these two concepts, I understand but I personally do not care if we know we are talking about the same thing) to have some level of inter-operability. That was the first version of the Process Management Interface (PMI). I am always a big advocate of giving credit where credit is due, so here we have to give credit to the Argonne National Laboratory (yay Pavan!) and also Cray for making it a “standard” used on their systems. Apologies to the individuals I am not citing here.

Once the community had the first interface for resource management, more work was done over time to make it better and more scalable as time went by and HPC systems became more complex. We had PMI-2 and a few years ago PMIx (Process Management for eXascale). Again, to give credit where credit is due, big shout out to Ralph Castain, the initiator and main contributor to both the PMIx specification and the OpenPMIx implementation, as well as Josh Hursey who is doing so much work moving the community forward. Sorry for all the others who are also doing a great job and who I am not mentioning here.

What about container?

So we now have specifications about how to manage resources and deploy/create processes and threads in scale manner through an entire large scale system. As mentioned before, we are slowly moving away from pure MPI workloads (which is a good thing). For that reason, we see MPI+X workloads (e.g., MPI+OpenMP) but also workloads relying on containers, very often packaging machine learning workloads. So how can we deploy, account and track workloads on these systems when they rely on containers? It clearly create another level of complexity: the resource manager needs to be aware of running containers, it should help to have scalable startup of containers across hundreds or thousands of compute nodes. I mean, let’s be honest, who wants to redo the work the PMI community did for so many decades? Certainly not me.

Another interesting point is that our MPI ranks, and more generally the processes/threads that instantiate our application will be running in the container. How do we make sure that this is all done correctly to deliver the performance you expect and for the system administrators that things happen as planned so you do not step on the toes of other users who may be running their jobs at the same time? And remember, these processes and threads running in your container may actually belong to different runtimes running in a container. Even worst, nothing prevents you from deploying a few MPI ranks directly on the host, create a few threads to do a quick computation and only then deploy a container side-by-side with what is already running. Sounds crazy? Well, application composition is becoming a reality so it is only a matter of time before more scientific groups rely on that model.

Okay, so we know it is complicated but do we have a path forward to precisely control and manage complex workloads on large scale systems, including when running container? And all that without asking the entire industry to completely change their software stack (let’s face it, it would be a no-go).

The good news is that we think we have an elegant solution to address all this complexity. Carlos Eduardo Arango Gutierrez, Cedric Clerget and I started to think about the problem. For the story, the beauty of the situation is that our backgrounds are so different that we came up with a solution that felt natural right away. The idea is almost trivial: What if we make the container runtime PMIx-aware. We can then receive allocation details from the global scheduler and act as a resource manager for everything happening in the container. It also creates a hierarchy of PMIx servers and therefore the solution should be scalable.

Okay the idea is simple and elegant but is it really doable?

After looking at the details, we think the PMIx specification/standard is pretty much ready as it is. An interesting question is to see if we could map the concept of namespace from PMIx onto the concept of namespace for containers. We think we could, both aims at partitioning the resources and create some level of isolation.

What about the implementation? More work is required because it would require to make a PMIx implementation aware of Singularity containers. Ultimately it is not difficult to foresee the benefits of having a PMIx implementation that could instantiate new containers on resources that are allocated for that. Having a fairly extensive knowledge of OpenPMIx, we believe that, yes, it requires work but it is very doable! So in other words, from a design point-of-view, we do not see any major road blocks and it should mainly be an engineering issue (which are never trivial but manageable).

Conclusion

We think we have a very nice design to extend our container solution for HPC and integrate it into existing infrastructures. In other words, it is novel but not pure research, it could actually be applicable in a fairly sort period of time. And to my personal surprise it seems that our ideas are very well received by the people to whom we presented it.