Jack of all containers

Gábor Samu
IBM Data Science in Practice
8 min readJun 30, 2021

Using containers in high performance computing (HPC) is not a new idea. The HPC community has used container technology in production for many years for workload resource isolation, process tracking, job controlling and even operations such as checkpoint, restart and job migration. Spurred by the convergence of HPC and AI and heterogeneous HPC environments including hybrid HPC clouds, containers are the ideal vehicle for providing consistency, reliability and isolation for modern workloads. But as with any technology, how can you easily leverage containers in your HPC environment in a manner that’s transparent to the end users?

Here we will steps through an example running a Julia Programming Language program in a Docker container through an HPC scheduler. This will highlight the importance of abstracting the job execution and application packaging details from the end user. Users of HPC should not be concerned about the underlying container technology for their jobs, rather they require to focus on the results from their jobs.

Over the years, many isolation technologies have emerged. In Linux environments, OS control groups (cgroups) provide a way to enforce resource limits and provide tracking of processes and accounting. Control groups also provide the foundation for container technologies that we see on Linux today and are used in HPC environments to provide isolation and reliability. Another example is the workload manager (WLM) on IBM AIX® which have been used for workload memory and CPU resource enforcement, and IBM PowerVM® along with workload partitions (WPAR), and logical partitions (LPAR) have been used for application isolation and mobility.

Have job, will travel

With OS control groups, applications share the same OS image. Applications which depend upon OS supplied libraries, for example, can be impacted when updates are applied. As the use of hybrid HPC cloud grows, you may end up with on-premise and cloud server instances running differing Linux distributions with different library versions, potentially requiring different builds of the same application.

Enter containers

Modern container technologies such as Docker provide isolation while packaging an application along with its dependencies. It also does so without the overhead of traditional VM technologies where a guest OS runs to host an application. It’s no wonder that container technologies have experienced widespread adoption in HPC, where consumers are focused on results from their simulation and modelling workloads and not the underlying complexities of kernel, glibc and other library versions.

Stacking containers

In 2014, IBM Spectrum LSF (LSF) introduced support for scheduling of Docker-containerized workloads. Over time, this has been extended to support Nvidia Docker, Podman, Singularity, Shifter, Enroot, Kubernetes and OpenShift and enables mixing containerized and non-containerized workloads while hiding the complexity of container management from end users. LSF provides a framework for easily managing and running containerized workloads. Users can run containerized workloads as batch jobs in the same way as running non-containerized workloads. Some key capabilities of containerized workload support in LSF includes:

  • Transparent container access — users don’t need to learn complex container syntax
  • All container startup and filesystem mounting is performed by LSF — users never gain elevated privileges
  • Tracks all resource usage and enforces limits and policies
  • Administrator control of which containers are allowed to be used in cluster
  • GPU reservations passed into container
  • Administrator visibility of container use including: host, container name, tags, source repository, file path, size, install time, age, last used, last used by, with optional affinity

Parlez-vous cloud native?

The advent of cloud native development and operating models based upon Kubernetes (also known as K8s) are being looked at with interest by the HPC community. Cloud native can help to open up HPC environments to a wide ecosystem of tools and middleware, which are key to the new era of converged AI and HPC environments.

K8s itself however has a very simple scheduler, which is better suited to long-running services, but it allows this default scheduler to be extended with other schedulers or scheduling polices. As mentioned above, LSF has an integration with K8s which enables LSF as a single authoritative scheduler, allowing a cluster to be shared between LSF and K8s workloads. LSF acts as a pod scheduler for K8s workloads, allowing LSF scheduling and prioritization policies to be applied to pod placement and allows these same policies applied consistently across K8s and LSF workloads. You can think of the integration as providing one “brain” to manage HPC and K8s workloads. This also means that:

  • LSF workflows can launch (persistent) containerized services on demand under K8s control
  • Services in K8s can launch LSF workloads as required
  • Direct LSF launch of containerized workloads can be mixed with K8s management

Containers in action

We’ll now take a closer look at how to enable container support in LSF and how to run a Docker containerized application. The example will use is the Julia Programming Language container from Docker Hub.

  1. To start, LSF must be configured to run Docker jobs. The LSF documentation section Preparing LSF to run Docker jobs provides instructions on the configuration files which are required to be updated. The subsequent steps assume that this has been performed.
  2. With the support for Docker jobs enabled (see Step 1), we must configure the Docker image that we wish to use. LSF supports the CONTAINER keyword at both the queue-level (lsb.queues) and application profile level (lsb.applications) to specify a supported container for submitted jobs. This configuration is defined by the LSF administrator and allows parameters such as the container image, and options for the startup of the container to be specified. For this example a julia application profile is created in LSB_CONFDIR/<clustername>/configdir/lsb.applications (where LSB_CONFDIR is defined in lsf.conf). The julia application profile will automatically pull the julia:latest container image from Docker Hub.
Begin Application
NAME = julia
DESCRIPTION = Example Julia application
CONTAINER = docker[image(docker.io/julia:latest) \
options(--rm --net=host --ipc=host \
--cap-add=SYS_PTRACE \
-v /etc/passwd:/etc/passwd \
-v /etc/group:/etc/group \
-v /apps/repository:/apps/repository \
@/apps/repository/scripts/docker_options.sh \
) starter(root)]
EXEC_DRIVER = context[user(lsfadmin)] \
starter[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-starter.py] \
controller[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-control.py] \
monitor[/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/docker-monitor.py]
End Application

After updating lsb.applications with the above Julia application profile, the LSF administrator must run the badmin reconfig command for the changes to take effect.

3. The julia application profile in Step 2 refers to a script known as docker_options.sh which maps the $HOME directory in the container environment to be /home. The script is created in a shared filesystem (/apps/repository/scripts) and is available on all servers in the LSF cluster. The contents of the script follows:

[root@rhserv1 ~]# more /apps/repository/scripts/docker_options.sh 
#!/bin/bash
echo "--volume $HOME:/home"
exit 0
[root@rhserv1 ~]# chmod 755 /apps/repository/scripts/docker_options.sh

4. Now we need an example Julia program to run in the Julia container job that will be submitted for execution to LSF. For this purpose, we will use the STREAM benchmark written in Julia from the research paper Benchmarking Julia’s Communication Performance:Is Julia HPC ready or Full HPC?. The STREAM example is placed to the shared directory /apps/repository/julia.

[root@rhserv1 ~]$ cd /apps/repository/julia[ibmuser@rhserv1 julia]$ git clone https://github.com/sebastian-steiner/STREAM.jl
Cloning into 'STREAM.jl'...
remote: Enumerating objects: 111, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (76/76), done.
remote: Total 111 (delta 48), reused 91 (delta 32), pack-reused 0
Receiving objects: 100% (111/111), 587.91 KiB | 0 bytes/s, done.
Resolving deltas: 100% (48/48), done.

5. With the julia application profile configured and an example Julia program available in the shared filesystem, we’re now ready to submit the job to LSF. The -app julia paramter is specified on the LSF bsub submission command line. This will apply the julia application profile to the submitted job and automatically pull down the Julia image from Docker Hub and start a Julia container. Note that the output from the job will be written to $HOME/stream.<jobid>.

[ibmuser@rhserv1 ~]$ bsub -o $HOME/stream.%J -app julia julia /apps/repository/julia/STREAM.jl/stream.jl 
Job <2395> is submitted to default queue <normal>.

The job is started under the control of LSF. We can interrogate job details using the bjobs command as follows. This shows details about the job including PIDs, PGIDs, memory, and CPU utilization of the Julia container and running STREAM application.

[ibmuser@rhserv1 ~]$ bjobs -l 2395Job <2395>, User <ibmuser>, Project <default>, Application <julia>, Status <RUN
>, Queue <normal>, Command <julia /apps/repository/julia/S
TREAM.jl/stream.jl>, Share group charged </ibmuser>
Wed Jun 2 18:07:46: Submitted from host <rhserv1.ibm.demo>, CWD <$HOME>, Outpu
t File </home/ibmuser/stream.2395>;
Wed Jun 2 18:07:46: Started 1 Task(s) on Host(s) <rhserv3.ibm.demo>, Allocated
1 Slot(s) on Host(s) <rhserv3.ibm.demo>, Execution Home <
/home/ibmuser>, Execution CWD </home/ibmuser>;
Wed Jun 2 18:08:14: Resource usage collected.
The CPU time used is 27 seconds.
MEM: 22.5 Gbytes; SWAP: 0 Mbytes; NTHREAD: 27
PGID: 24407; PIDs: 24407 24408 24409 24542
PGID: 24585; PIDs: 24585 24599
MEMORY USAGE:
MAX MEM: 22.5 Gbytes; AVG MEM: 17.3 Gbytes
GPFSIO DATA:
READ: ~0 bytes; WRITE: ~0 bytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[(defined(docker)) && (type == any)] order[r15s:pg]
Effective: select[(defined(docker)) && (type == any)] order[r15s:pg]

We see in the code block above that the job is being executed on server rhserv3.ibm.com. We can login to the server and use the docker ps command to confirm that the container is running.

[ibmuser@rhserv3 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ea2d7dbf0291 julia:latest "/home/ibmuser/.lsba…" 57 seconds ago Up 56 seconds hungry_kepler

6. The Julia STREAM job runs to completion after a few minutes of running. The output from the job is shown below.

[ibmuser@rhserv1 ~]$ cat $HOME/stream.2395 
STREAM.jl
----------------------------------------------
Array size = 1000000000 (elements) Offset = 0 (elements)
Memory per array = 7629.395 MiB (= 7.451 GiB)
Total memory = 22888.184 MiB
Each kernel will be executed 100 times
The *best* time for each kernel (excluding the first run)
will be used to compute the reported bandwidth.
----------------------------------------------
Using 1 threads
Your clock granularity/precision appears to be 1041.000ns
Each test below will take on the order of 931384.518 microseconds
(= 894.000 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20,000 clock ticks per test.
----------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 12350.8 1.344049 1.295459 1.514592
Scale: 12412.2 1.330782 1.289059 1.512967
Add: 12873.9 1.941237 1.864232 2.186009
Triad: 12810.1 1.968008 1.873516 2.251539
Solution Validates: avg error less than 1.000000e-13 on all three arrays
------------------------------------------------------------
Sender: LSF System <lsfadmin@rhserv3.ibm.demo>
Subject: Job 2395: <julia /apps/repository/julia/STREAM.jl/stream.jl> in cluster <LSF_cluster> Done
Job <julia /apps/repository/julia/STREAM.jl/stream.jl> was submitted from host <rhserv1.ibm.demo> by user <ibmuser> in cluster <LSF_cluster> at Wed Jun 2 18:07:46 2021
Job was executed on host(s) <rhserv3.ibm.demo>, in queue <normal>, as user <ibmuser> in cluster <LSF_cluster> at Wed Jun 2 18:07:46 2021
</home/ibmuser> was used as the home directory.
</home/ibmuser> was used as the working directory.
Started at Wed Jun 2 18:07:46 2021
Terminated at Wed Jun 2 18:19:04 2021
Results reported at Wed Jun 2 18:19:04 2021
Your job looked like:------------------------------------------------------------
# LSBATCH: User input
julia /apps/repository/julia/STREAM.jl/stream.jl
------------------------------------------------------------
Successfully completed.Resource usage summary:CPU time : 674.39 sec.
Max Memory : 23103 MB
Average Memory : 22665.95 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 7
Max Threads : 37
Run time : 678 sec.
Turnaround time : 678 sec.
The output (if any) is above this job summary.

Conclusion

Containers provide a significant number of benefits for HPC environments including performance isolation, application encapsulation for ease of mobility (to the cloud) and lifecycle management of applications. We have shown how IBM Spectrum LSF can seamlessly run containerized workloads and supports the most commonly used container frameworks. As AI methods are increasingly adopted by users of HPC, containers offer the ideal vehicle to take advantage of the rapidly evolving AI software ecosystem. Find out more about IBM Spectrum LSF here.

--

--

Gábor Samu
IBM Data Science in Practice

Senior Product Manager at IBM specialized in Spectrum Computing products. Over 20 years experience in high performance computing technology. Retro computing fan