Singularity and MPI: Compatibility Matrices

Published in

SingularityApp

6 min readNov 19, 2019

(This is a technical article that presents a specific aspect of using Singularity in the context of high performance computing. As a result, the goal is not to have a quick read but instead present an experience, a story that I think can be valuable to others. I tried to organize the document to let readers skip sections that are not be of interest to them.)

Singularity, as a container solution, has been developed with High Performance Computing (HPC) in mind: no daemon running on the compute nodes and low-to-no overhead, even when executing applications using the Message Passing Interface (MPI). But, like everything touching MPI, everything is not that simple. A common way to use MPI and containers is to use the MPI available on the host to start containers, which ultimately host the MPI ranks. In other words, the application is started with a command such as: mpirun -np N singularity exec ./my_container.sif /opt/my_app.exe.

Because the MPI available on the host is used to start containers on compute nodes, this is called the hybrid model. In that context, a question I very often get from users is: When a specific version of an MPI implementation is available on the host, what version can be used on the containers? So far answers are something like: “well, for most cases, it should just work”. But we did not have accurate data to back up our conclusions so we decide to develop a tool that would just do that: give us the required data.

The concept of compatibility matrices

To answer that question, the community created the context of compatibility matrix: based on what implementation and version of MPI is available on the host, what version of the same implementation can be used in containers. And so far, these compatibility matrices have been created by hand by a few of us. To make the process less cumbersome, deterministic and reproducible, we developed a new tool that would run all required tests for us and create the compatibility matrices: SyValidate, which is part of the singularity-mpi environment.

Why a new tool?

Who wants to create compatibility matrices when considering a larger number of release? For instance, i am personally interested in Open MPI 3.0.4, 3.1.0, 3.1.4, 4.0.0, 4.0.1 and 4.0.2, as well as MPICH 3.0, 3.0.4, 3.1, 3.1.4, 3.2, 3.2.1, and 3.3. Creating the compatibility matrices for both Open MPI and MPICH therefore represents a significant amount of work, that is repetitive and, to be honest, frustrating and boring. Think about it, the complexity of such work is O(N!), where N is the number of versions of a specific implementation to test. And this is without considering that the Linux distribution in the container might be different from the one in the containers. The complexity increases even more when considering the fact that MPI actually offers 3 different semantics that require different tests: point-to-point, collective and one-sided communications.

What does the tool do?

The tool will consider a set of versions for a specific implementation of MPI, e.g., Open MPI or MPICH, figure out all the combinations of versions on the host and in the container, and perform the following tasks:

automatically install MPI on the host,
create containers for the tests to execute,
run the tests,
compile the results,
And when all the tests are executed, create a compatibility matrix.

As we mentioned before, 3 different MPI semantics are interesting to users: point-to-point, collective and one-sided communications. To capture most of the semantics we currently run 3 different tests, with an additional one under development. The tests currently are:

A basic helloworld test to exercise quickly initiating and finalizing an MPI application; if it fails, no need to run any additional test.
NetPIPE for point-to-point communications.
IMB for collective communications.

We are currently adding the IMB tests for one-sided communications.

What compatibility matrices did we create?

At the moment, the tool was executed on two different systems using Singularity 3.5.0, which was released a couple of days ago. These 2 systems are unfortunately composed of a single node, which i apologize about. I am the first to recognize that using a single node is limiting the interest for the compatibility matrices since MPI will end up working quite differently on a single node compared to an execution using multiple nodes. However, it is still very valuable data, especially for a new release of Singularity: it lets us ensure that the MPI support is not suffering critical problems.

Our first system is based on CentOS 7. On that system, we discovered that any combination of Open MPI 3.0.4, 3.1.0, 3.1.4, 4.0.0, 4.0.1 and 4.0.2 can work together. As for MPICH, we discovered that any combination of MPICH 3.0, 3.0.4, 3.1, 3.1.4, 3.2, 3.2.1, and 3.3 work together.

We had the same results with our Ubuntu Disco system.

In other words, MPI implementors did a great job at ensuring that various versions of their software can work together. It sounds quite trivial but i would personally claim that it is a great achievement, especially for a standard that does not define any wire protocol for communication.

What is next?

Well, our results are currently pass/fail results, meaning that we only check if a specific configuration fails or not. But it is not really quite enough: using MPI if we cannot get the expected performance from the hardware, especially the network, is kind of pointless. So we would really need to track performance regressions to be able to identify which combinations of MPI on the host and in the container do not offer the expected performance.

We actually already started to work on this with NetPIPE: for each test, we capture the performance summary provided by NetPIPE and store it in addition on the simpler pass/fail results.

What did we discover? On our CentOS 7 system, using Open MPI 4.1.0 on the host with 3.0.4 in the container is roughly 25% slower than running Open MPI 3.0.4 on the host and 3.1.0 in the container. With MPICH, using version 3.1 on the host and version 3.3 in the container is roughly 27% slower than running 3.3.2 on the host and 3.2 in the container.

But once again, what does it mean? Well, at this point we cannot really conclude anything except that there is a real need to track performance regressions in addition of the simpler pass/fail compatibility matrices. Unfortunately, with the current version of our tool, we do not gather enough data to be able to draw any conclusions that are statistically relevant. For instance, some of the questions are: Are these numbers based on outliners because of unexpected event on the systems? For each of the result, we only executed NetPIPE once, so even if we use the summary result (which is based on multiple sub tests within NetPIPE), is it fair to assume that it is statistically relevant? I would personally argue that it is not.

Conclusion

SyValidate is a very valuable tool that gives us some good preliminary results and compatibility matrices and it already answers a lot of questions from the community. However, it is clear that we need more than compatibility matrices based on pass/fail semantics. We really need to track performance regressions.

Another interesting result would be to create compatibility matrices for various Linux distributions in the containers. Containers are an attractive solution because they let you bring your own environment to your HPC system. But what would happen when we rely on the MPI from the host? Will we face glibc incompatibility issues? And if so, what about the other MPI usage models that do not require us to have a potentially different MPI on the host and in the containers?

The good news is that we are actively working on all these questions so stay tune and feel free to join the Singularity community to ask questions, make suggestions and follow our work.