Portable radio astronomy data processing pipelines

Gijs Molenaar
15 min readJan 16, 2018

--

We have used CWL, KERN, Docker, Singularity, and Slurm to build flexible and dynamic pipelines used for analysing radio astronomy data.

Wouldn’t it just be perfect to create a data processing pipeline on your laptop and then deploy it in a cloud environment or supercomputer, without any modifications? I think we have found a combination of technologies that help us realise this geeky dream. For the last couple of months, Michael Crusoe and I have been working closely together with ASTRON and SURFsara on a series of prototype data reduction pipelines for the LOFAR science demonstration as part of the EOSC pilot (European Open Science Cloud for Research Pilot Project). Our goal was to make existing radio astronomy research software and hardware available to a broader audience. We successfully achieved these project goals and moreover, we found that these prototype pipelines are useful as a guide for the creation of pipelines used by real radio astronomers.

My contribution to the project was:

  • Take three radio astronomy data reduction pipelines and make these ‘portable’
  • Run and evaluate these pipelines on various platforms, from MacBook to cloud infrastructure and from a single server to supercomputer.
The LOFAR telescope core in the Netherland

So, what exactly is a data reduction pipeline? Such a pipeline is nothing more than a chain (or a DAG, a Directed Acyclic Graph) of connected software programs that each take data as an input, do some processing and then forward this processed data to a next step until the final data product is computed. This final product data often is supporting evidence which a scientist can use to support his or her work. The word ‘reduction’ refers to the volume of data passing through the pipeline, which typically would be reduced. In the case of radio astronomy, the datasets are huge, a single observation can be several terabytes in size.

Why is this difficult to achieve?

First of all, this is a challenging problem since it can be difficult to install the software on an arbitrary system. Installing software on for example Microsoft Windows is usually very different compared to installing software on GNU/Debian. This all became easier with the broader acceptance of containerisation technology. Containerisation is the practice of delivering your software preinstalled in an isolated environment, bundled with all the dependencies and often a complete operating system. This makes the installation of the software trivial, you only need obtain the container and have the corresponding container software installed. The popularity of containerisation is arguably driven by the user growth of Docker. But Docker is a bit like blockchain, some users believe they need it but often are not sure how best to utilise it. I think Docker is not very suited for usage in a High Performance Computing environment and is just the wrong tool for this job. There are other containerisation frameworks out there which are more adequate.

The pipelines

We have enhanced three pipelines. The first pipeline is Prefactor which is used for performing the first calibration of LOFAR data. Our altered version of this pipeline has its own Github repository.

Example of a pulsar found with Presto

The second pipeline is based on the Presto beginners tutorial. Presto is a framework to search for pulsars in radio frequency datasets. Since the Presto tutorial has some manual steps we have split this pipeline into two parts. The project files can be found in a Github repository also.

Spiel, the radio simulation pipeline

The final pipeline is called Spiel. Spiel is a radio telescope simulator, which generates ‘dirty’ images from an input image. Basically, it is a telescope calibration in reverse. Where normally during calibration you try to remove or reduce the artefacts introduced by the telescope itself, this pipeline introduces these artefacts in a ‘clean’ image. This can be very useful to generate a lot of data in a controlled way based on a ‘ground truth’ which can be used to develop and verify new algorithms to calibrate the telescope.

The spiel pipeline is interesting because it demonstrates how to easily call a CASA tasks from our workflow.

Software packaging

We want to have a flexible solution based on loosely coupled and open source frameworks. Containerisation plays a very important role in this flexibility. Furthermore, we want to be agnostic about our container technology. Just compiling all requirements into a Docker container does not result in an easily maintainable solution and you will encounter problems if your target platform (like the Cartesius supercomputer) does not allow the running of Docker containers.

We found it is a good idea to take one step back, and rely on old and uninteresting technology that has proven itself time and time again; packaging your software up. You only have to do this once, and can then populate your container type of choice with these packages. This can even be done outside a container if you happen to run a platform supported
by the package.

For the project, we started with making Debian packages for all required software. We selected the Debian packaging method since Debian is bundled with a huge set of packages, comes with a great set of support tools for building packages and has an excellent dependency management system. Alternatives are RPM and Anaconda, the latter is getting traction within the scientific community lately. While RPM and Debian packages eventually don’t differ that much, we believe Anaconda packages are not (yet) of the same quality as Debian packages usually are and do not think that the Anaconda package tooling is as mature as that of Debian’s.

These packages became part of KERN, the radio astronomy software suite. KERN is nothing more than a set of Debian packages which are updated and released every six months. Some KERN packages like Casacore, WSClean, and AOFlagger eventually ended up in the official Debian archives. But various other astronomy packages have such a small user base or are just of such poor quality that there is no place for them in Debian. KERN tries to offer a coherent, consistent and stable environment for the radio astronomers workstation, but also server or cloud infrastructure. KERN-3 is based on the latest Ubuntu LTS release, which is currently Ubuntu 16.04.

The following shows an example of how to install the Prefactor software on a system:

$ sudo apt-add-repository -s ppa:kernsuite/kern-3
$ sudo apt-get update
$ sudo apt-get install prefactor

The first command adds the KERN repository to your system. From there you can download and install all packages available in KERN.

For our pipelines, we evaluate 3 containerisation platforms, Docker, singularity and uDocker (user-space Docker). Singularity can be seen as the Docker for HPC. It lacks various features Docker offers, but which are not required in a HPC environment. Additionally, it doesn’t run a network service as root, making it arguably more secure. Singularity is already supported by various HPC providers around the world. For example, on Cartesius at SURFsara in the Netherlands, on the cluster at CPHC in South Africa, and on Colonial One, George Washington University in the US. uDocker is relatively unknown, but enables running Docker container without a Docker daemon and doesn’t require administrative privileges. This solves many problems one can have deploying Docker in a multi-user environment.

If you are using Docker, uDocker, or Singularity, you can use our KERN base image in a Dockerfile:

FROM kernsuite/base:3
RUN docker-apt install prefactor

The workflow framework

Next, we need to pick a method to tie our software together. As described earlier, we want to model our workflow as a DAG. There are countless frameworks to do this, and naively I also made my own called Kliko. While developing Kliko I stumbled upon Common Workflow Language (CWL) project. Although I did not compare all frameworks available, I really enjoyed working with CWL, since it shares many ideas and concepts with Kliko. I also found that CWL was much more mature, developed and had a greater user base. What I also like about CWL is that it actually isn’t a framework, it is a standard. There are various (open source) workflow engines that have support for the CWL syntax. This makes CWL satisfy our flexibility requirement and makes a vendor lock-in much less likely.

The Prefactor CWL pipeline, visualised with Rabix and coloured by hand

There are various workflow engines that support CWL. For example the CWL reference implementation, which is always up to date with the latest specification but doesn’t do more complex tasks like parallelisation of jobs or support external scheduler backends. Then there is the open source Python-based Toil, a complete but still lean implementation that does support parallelisation and scheduling. Also necessary to mention is Arvados, an all-in-one workflow management platform. There is also the Rabix executor and its companion Rabix composer which has a good-looking user interface to create and modify CWL projects. Other workflow engines like Pachyderm are discussing adding support for CWL.

For developing and deployment of our three pipelines we have used the reference runner and Toil.

A CWL command definition looks like this:

This is the simplified CWL definition for the WSClean program. It assumes the wsclean command is available on your target platform, or inside the specified Docker container. As suggested in the file, the Docker image id is a ‘hint’, and not a strict requirement. When this CWL file is executed and the user indicates Docker should be used, this program will be run inside that container.

The rest of the CWL file is quite straightforward. Arguments are listed, which can be defined in a CWL job file, or on the command line when running the pipeline. These parameters are converted during runtime into arguments for the encapsulated program. Additionally, the input and output products are defined, which are usually files or directories.

One of the useful things the CWL reference runner can do is cache the intermediate results. Let’s say you have a long-running pipeline, but halfway through the run you discover you made a typo in a script and the full pipeline run fails. CWL enables you to correct this error and continue your pipeline run, reusing the intermediate data products stored upon then.

Also, jobs in the compute graph that do not depend on related input products can run in parallel. If your CWL runner supports this and you have the resources available, jobs that can run in parallel will run in parallel, without any explicit instructions. This is similar to how the ‘make’ build system speeds up compilation.

During the development, we found some bugs in the CWL implementations and ambiguities in the CWL standard. For example, nested directory structures, typical for radio astronomy datasets (measurement sets) were not properly handled. This has been fixed. Another important addition to the standard introduced upon our request is the optional in-place writing. When datasets grow in the order of terabytes you do not want to copy the results around, but modify them in place. This is now also supported. Lastly, we spotted a bug where intermediate array products were not properly sorted, resulting in alignment problems if you use multiple arrays. This issue has been resolved.

Running the pipelines

We ran the pipelines on a variety of platforms, A MacBook, a Linux Server, the SURFsara HPC cloud environment and lastly Cartesius, the Dutch national supercomputer, also situated at SURFsara.

The Macbook (Pro, late 2013, 2.4 GHz Core I5, 2 cores, 8 GB, SSD) is an interesting platform is this list because although all of the platform architectures are Intel, the MacBook is the only non-Linux operating system. In this case, Docker for Mac showed itself to be useful. Under the hood, it runs a small Linux virtual machine running an actual Docker daemon. The Docker client for Mac transparently runs the Docker container inside the virtual machine, while maintaining the same command line interface to the user.

The Linux server is an Ubuntu 16.04 dual processor (Xeon E5–2660 2GHz) system with 12 cores each and 500 GB memory. We had administrative access to this system and because the operating system matches the KERN target platform we could install all packages natively. We could also evaluate all containerisation types.

The SURFsara HPC cloud environment is an OpenNebula cluster where you can spawn your own cluster of arbitrary size. the service comes with a web service, which is quite painful to use. But, through trial and error, we managed to get a couple of machines up and running. There is also a REST API available, which you can use to automate these tasks. There are Python binding available, but unfortunately, this project seems unmaintained and is partially broken. Some features only work in the latest master, and it seems some tasks can’t be automated and require manual interaction.

For our experiments we kept it simple, we’ve set up a three node cluster, where one of the nodes was designated as a master controller. Normally for the sake of reliability, you would set up two or more control servers, but that was currently not of our interest. We have automated the creation and populating of our mini cluster using the cloud API and Ansible. The choice for Ansible was arbitrary, there are many other tools like Chef, Puppet or CFEngine available that solve the same or a similar job. Since we are already going to evaluate Slurm as a scheduling backend on Cartesius we decided to try to set up a Mesos cluster. Our scripts set up a Mesos cluster from scratch, and we can submit a CWL job to the Mesos. Unfortunately, we got stuck there, somehow the eventual job does not run. We didn’t have enough time left to fully investigate this issue, so we, unfortunately, do not have any benchmarking results for Mesos.

Cartesius, the Dutch national supercomputer at SURFsara

Our final evaluated platform is Cartesius, the Dutch national supercomputer. This computer runs bullx Linux, and has a total of about 50.000 CPU cores. Like various modern supercomputers, this computer is more like a cluster of Intel nodes with fast interconnected and attached storage. To submit jobs to Cartesius you log in to a login node, from where you use Slurm to submit your compute job to the job queue, from where the scheduler will take care of starting and monitoring the execution. For this pilot, we were granted 500.000 CPU hours on this machine.

The datasets

To evaluate the pipelines we need to have some data. The Presto tutorial comes with a small dataset known to contain a pulsar. For Spiel we don’t need a dataset, this pipeline generates simulations so it only requires simulation parameters.

To demonstrate the Prefactor pipeline works properly we need a bigger dataset. The ampl step in the pipeline require at least 20 radio subbands, otherwise, it will give an error. We have composed a dataset consisting of 20 sub bands and 10 time deltas. This results about 369 MB of data (compressed). This dataset is unrealistically small, a typical radio astronomy dataset is usually at least in the order of tens of gigabytes. However, this dataset just contains enough data to demonstrate all the working parts and compare performance on various architectures while still having a manageable size.

Benchmarks

All tests were performed multiple times and the minimal running time is listed below. Also when containers are used the containers have been prefetched.

platform: Macbook
CWL runner: ref runner, toil
Containerisation: Docker
Scheduler: None
Running time (minimum):
- Reference runner: 40 minutes
- Toil: 25 minutes

As expected, Toil runs much faster than the reference runner due to the implicit job parallelisation.

platform: server
CWL runner: ref runner, toil
Containerisation: Vanilla, Docker, uDocker, singularity
scheduler: None
Running time:
- Reference runner + Vanilla: 15 minutes and 19 seconds
- Toil + Vanilla: 2 minutes 55 seconds
- Toil + Singularity: 2 minutes 58 seconds
- Toil + Docker : 3 minutes 24 seconds
- Toil + uDocker: 6 minutes and 46 seconds

It is very satisfying to see that the pipeline becomes blazingly fast on this server. Vanilla means ‘native’ here, we directly run the commands on the server without containerisation. It is interesting to see that Singularity gets very close to the vanilla performance, while Docker introduces some overhead. This overhead might become marginal when the size of the jobs grows. Further investigation is needed there.

platform: Cartesius
CWL runner: toil
Containerisation: singularity
scheduler: slurm
Extra: Short partition (TOIL_SLURM_ARGS="-t 0:30:00 -p short")
Running time:
- Toil: 4m minutes 52 seconds.

It is not surprising that although we run our pipeline on a big supercomputer the eventual performance is worse compared to a private server. This can be explained by the overhead of multi-node scheduling and the queuing of our jobs. Also, since the system is a multi-user system, performance depends greatly on utilisation by other users during benchmarking.

Technical detail is that we had to manually load the Python environment runningmodule load python. Complex CWL pipelines depend on node-js which is not available as a module on Cartesius. Eventually, we don’t require node-js anymore, but if you do you can use spack. Since singularity is not installed on the login nodes, we had to manually schedule a command on an execution node to create a Singularity image from our prepared container image from the Docker hub.

Potential issues and limitations

There are of course some limitations to our solution. First of all, the CWL runners we tested are not aware of data locality. In the case of distributed computing we enjoyed the fact that we have a unified filesystem and thus didn’t need a CWL runner with advanced data locality optimisations. Even better, the CWL standards don’t make a unified storage assumption so we can switch to more advanced workflow orchestrators in the future without having to change our pipelines. Secondly, we only support programs that read or write from files or stdin/stdout. Third is container security; Docker is known to be insecure in a multi-user environment, Singularity seems to be more fitted for this purpose. Although Singularity does contain a library with a setuid bit set which might expose a security risk, many HPC sites around the world have adopted this technology already. The final relevant issue we see is the rigid dependencies between Cuda drivers and Nvidia kernel modules. If you run a container containing a Cuda version not matching the hosts Nvidia kernel module version, GPU acceleration just doesn’t work. This breaks the host-platform independence. For Docker there is a workaround in the form of a Docker extension, but you are out of luck if you use a different container technology.

Future Work

The CWL definition files can be reused, just like the packages of astronomy software. We only wrote definitions for a tiny set of astronomy tools, and hopefully, there will be many more to come. In the long run, we hope for the birth of a central repository of CWL definitions which users can use to compose their own pipelines.

An often heard complaint is that the CWL documentation is hard to get through, and it is hard to find the details a specific feature. We hope to improve the documentation experience.

Currently, there is no direct support for singularity in any of the CWL runners evaluated. This will be added in the next couple of months. For now, we work around this by prepending a singularity command to the command definition in your CWL file.

Another great win could be to improve the task scheduling intelligence. When the task scheduling overhead is big, for example on Cartesius, it makes much more sense to bundle tiny jobs together into a big serial job. Currently, there is no such intelligence implemented.

Conclusion

This pilot has been a success and we are very pleased with the results. We could quickly set up the three pipelines using the KERN packages, which shows that packaging scientific software plays an important role. Making CWL pipelines turned out to be quite easy and results in flexible pipelines which can run unmodified on various architectures, from MacBook to supercomputer. Toil turns out to be a good CWL implementation and has support for various job scheduling frameworks. The Slurm backend just worked flawlessly, but unfortunately, we had problems getting our Mesos up and running.

While Docker was very useful for getting our pipelines running on OS X, in a multi-user environment like a supercomputer it is unsuited due to the security implications. The proof-of-concept udocker shows that Docker containers can be run in a safe way, but the current architecture of Docker (a daemon running as root) poses too much of a risk to get deployed in HPC environments. Singularity is more suited for this environment. uDocker shows that it is possible to run Docker containers in a secure way, but is more a technology demo than a useful and robust technology ready for production.

So if you are still reading this and want to start making a new data reduction pipeline or modify and improve an existing one; in our case, the combination KERN, CWL, Toil, Slurm, and Singularity made us very happy :)

That said, all described technologies are Open Source software and can only exist because of funding by companies or institutes in money and/or time. If you encounter problems using, or would like to have features added to these software, please consider allocation financial or human resources to these projects.

Stay up to date

Follow us on Twitter:

--

--

Gijs Molenaar

Software composer, music brewer, beer programmer, working @spotify