Flexing your Arm muscle— for HPC

Gábor Samu
IBM Data Science in Practice
12 min readMay 18, 2021

--

From its humble beginnings in Cambridge, England, Arm has become a global leader in processor cores, which are at the heart of our digital world. But Arm is no longer synonymous with only mobile devices. Today, Arm processors are found powering single board computers such as the Raspberry Pi and the latest gear from Apple, as well as the cloud (AWS Graviton) and the current world’s faster supercomputer, Fugaku. Arm has been flexing its muscles and has shaken up the establishment. I have been working with Arm-based systems since before their rise to widespread adoption. A rarity in Canada in the 1990’s, I was fortunate enough to have the opportunity to tinker with Arm-powered Acorn Archimedes computers at Olivetti Canada. Little did I know at the time the pervasiveness that Arm would have today.

Arm-powered systems in high performance computing (HPC) are no longer just a vision. They are making significant waves in HPC working on complex problems including COVID-19 research and the software ecosystem has also gained significant momentum.

An essential software component of any high performance computing cluster is a workload scheduler. Workload schedulers are akin to traffic police, and help to ensure that jobs scheduled in the cluster get access to the right resources at the right time. On the surface this may seem like an easy task, but when clusters contain thousand of servers, with users competing for resources against the backdrop of business priorities, workload schedulers are crucial for getting the most out of these valuable computing resources. There exists a number of workload schedulers today, from open source to closed source proprietary. IBM Spectrum LSF is a workload scheduler which supports Linux on x86–64, IBM Power (LE) and Arm processors. Spectrum LSF has a long pedigree in HPC workload scheduling and is available as a Community Edition and IBM Spectrum LSF Community Edition is free to download (registration required) for use on up to 10 servers (dual-socket) with a maximum of 1000 active jobs.

Below, we’ll walk through the steps to install IBM Spectrum LSF Community Edition on a single, quad-core Arm Cortex A-72 based system, in this case, a SolidRun MACCHIATObin running openSUSE Tumbleweed (aarch64). We’ll also cover running a few example jobs. In an earlier article, How do I love my HPC, let me count the ways, we touched on the importance of workload schedulers in high performance computing clusters. Why would I need IBM Spectrum LSF Community Edition on a single server? If you’re using a powerful deskside server to run processor and memory-intensive workloads such as modelling and simulation, you need a way to ensure that these workloads don’t step on each others’ toes. IBM Spectrum LSF Community Edition will ensure that work is scheduled when it makes sense, and can handle failures of work according to defined policies.

1. Download and extract

We’ll start by downloading the IBM Spectrum LSF Community Edition package for armv8 and the Quick start guide from the following URL. Note that you are required to register for an IBMid (if you don’t have one already) in order to be able to access the packages. Expanding the gzipped tarball reveals two tarballs; one armv8 package containing binaries, and one installation package. Next, extract the lsfinstall tarball. This contains the installer for IBM Spectrum LSF Community Edition.

[root@flotta2 tmp]# ls
lsfsce10.2.0.11-armv8.tar.gz
[root@flotta2 tmp]# tar zxvf ./lsfsce10.2.0.11-armv8.tar.gz
lsfsce10.2.0.11-armv8/
lsfsce10.2.0.11-armv8/lsf/
lsfsce10.2.0.11-armv8/lsf/lsf10.1_lnx312-lib217-armv8.tar.Z
lsfsce10.2.0.11-armv8/lsf/lsf10.1_no_jre_lsfinstall.tar.Z
[root@flotta2 tmp]# cd lsfsce10.2.0.11-armv8/lsf/
[root@flotta2 lsf]# ls
lsf10.1_lnx312-lib217-armv8.tar.Z lsf10.1_no_jre_lsfinstall.tar.Z
[root@flotta2 lsf]# tar zxvf ./lsf10.1_no_jre_lsfinstall.tar.Z
lsf10.1_lsfinstall/
lsf10.1_lsfinstall/instlib/
lsf10.1_lsfinstall/instlib/lsflib.sh
lsf10.1_lsfinstall/instlib/lsferror.tbl
lsf10.1_lsfinstall/instlib/lsfprechkfuncs.sh
....
....

2. Configure the installer

After extracting the lsfinstall tarball, you’ll find the installation configuration file install.config. This file controls the installation settings including things such as installation location, LSF administrator account, name of cluster, master node (where scheduler daemon runs), and the location of binary source packages among other things. A diff is included below to show the settings. Note that the user account lsfadmin has been created on the system (and must exist on all servers in the LSF cluster):

  • installation location (LSF_TOP): /opt/ibm/lsfsce
  • LSF administrator account (LSF_ADMINS): lsfadmin
  • LSF cluster name (LSF_CLUSTER_NAME): Klaszter
  • scheduler node (LSF_MASTER_LIST): flotta2
  • location of LSF binary source packages (LSF_TARDIR): /tmp/lsfsce10.2.0.11-armv8/lsf (the location of lsf10.1_lnx312-lib217-armv8.tar.Z from step 1)
[root@flotta2 lsf10.1_lsfinstall]# diff -u1 ./install.config_org ./install.config
--- ./install.config_org 2021-04-08 12:08:05.677381501 -0400
+++ ./install.config 2021-04-08 12:22:54.098764027 -0400
@@ -42,3 +42,3 @@
# -----------------
-# LSF_TOP="/usr/share/lsf"
+LSF_TOP="/opt/ibm/lsfsce"
# -----------------
@@ -52,3 +52,3 @@
# -----------------
-# LSF_ADMINS="lsfadmin user1 user2"
+LSF_ADMINS="lsfadmin"
# -----------------
@@ -69,3 +69,3 @@
# -----------------
-# LSF_CLUSTER_NAME="cluster1"
+LSF_CLUSTER_NAME="Klaszter"
# -----------------
@@ -84,3 +84,3 @@
# -----------------
-# LSF_MASTER_LIST="hostm hosta hostc"
+LSF_MASTER_LIST="flotta2"
# -----------------
@@ -94,3 +94,3 @@
# -----------------
-# LSF_TARDIR="/usr/share/lsf_distrib/"
+LSF_TARDIR="/tmp/lsfsce10.2.0.11-armv8/lsf"
# -----------------

3. Running the installer

With the installation configuration file prepared, we can now invoke the IBM Spectrum LSF Community Edition installer (the installation output below has been truncated for brevity).

[root@flotta2 lsf10.1_lsfinstall]# ./lsfinstall -f ./install.configLogging installation sequence in /tmp/lsfsce10.2.0.11-armv8/lsf/lsf10.1_lsfinstall/Install.log[License acceptance]LSF pre-installation check ...Searching LSF 10.1 distribution tar files in /tmp/lsfsce10.2.0.11-armv8/lsf Please wait ...1) linux3.12-glibc2.17-armv8Press 1 or Enter to install this host type: 1You have chosen the following tar file(s):
lsf10.1_lnx312-lib217-armv8
....
....
Installing linux3.12-glibc2.17-armv8 ...Please wait, extracting lsf10.1_lnx312-lib217-armv8 may take up to a few minutes ...........lsfinstall is done.

4. Start me up

The installation process typical goes quickly. With the installation completed, we are now ready to start IBM Spectrum LSF Community Edition so that it’s ready to accept and manage jobs. Before we do that, let’s cover some LSF concepts and terminology which will help to make sense of what we’re doing below.

LSF is designed to run on a cluster of hosts (or VMs), both on-premise and in the cloud. An LSF cluster must contain a minimum of one management host which you can think of as the orchestrator for the cluster. It runs the scheduler, which makes decisions on where to place work based on load information that’s collected from all of the servers in the environment. The LSF management host runs the following daemons:

  • Management host LIM (Load Information Manager)
  • RES (Resource Execution Server)
  • SBATCHD (Server Batch Daemon)
  • MBATCHD (Management Batch Daemon)

LSF clusters can also be configured with a backup management node which is known as the management candidate host. This host can take over if the management host becomes unavailable for any reason. Failover from the management host to the management candidate host is seamless to the end users.

LSF servers are the worker bees in the cluster — those are the systems which LSF will send jobs for processing. Note that the LSF management host can also run batch jobs if required. LSF server hosts run the following daemons:

  • Server host LIM (Load Information Manager)
  • RES (Resource Execution Server)
  • SBATCHD (Server Batch Daemon)

Our LSF cluster has only a single server in it. So it will act as both an LSF management node and an LSF server and will be able to run jobs. There are a few commands we’ll be using below to both start LSF and to submit and manage work. A brief description of those commands follows:

  • lsadmin — Control LIM and RES daemon startup
  • badmin — Control SBATCHD and LSF batch system
  • lsid — Display basic information about the LSF cluster
  • lsload — Display load information for LSF hosts
  • bhosts — Display batch system status for LSF hosts
  • bsub — Submit a job to LSF
  • bjobs — Display information about LSF jobs

As the source the environment for IBM Spectrum LSF Community Edition. This will configure the PATH and other needed environment variables. The LSF daemons are started using the lsadmin and badmin commands.

[root@flotta2 conf]# pwd
/opt/ibm/lsfsce/conf
[root@flotta2 conf]# . ./profile.lsf[root@flotta2 conf]# lsadmin limstartup
Starting up LIM on <flotta2> ...... done
[root@flotta2 conf]# lsadmin resstartup
Starting up RES on <flotta2> ...... done
[root@flotta2 conf]# badmin hstartup
Starting up server batch daemon on <flotta2> ...... done

As a non-root user, source the environment for IBM Spectrum LSF Community Edition as shown above and check the status of the cluster. We confirm below with the user commands lsid, lsload and bhosts that the LSF cluster is up and running and ready to accept jobs. You can find out more about these commands at IBM Documentation.

[gsamu@flotta2 conf]$ pwd
/opt/ibm/lsfsce/conf
[gsamu@flotta2 conf]$ . ./profile.lsf[gsamu@flotta2 conf]$ lsid
IBM Spectrum LSF Community Edition 10.1.0.11, Nov 12 2020
Copyright IBM Corp. 1992, 2016. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is Klaszter
My master name is flotta2
[gsamu@flotta2 conf]$ lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
flotta2 ok 0.0 0.5 0.2 3% 0.0 1 0 7014M 7.4G 9.7G
[gsamu@flotta2 conf]$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
flotta2 ok - 4 0 0 0 0 0

5. Submitting jobs

A sleep job is akin to running a “hello world” program in the world of HPC schedulers. So, for the first test, we submit a short sleep job as a non-root user in the cluster. A 10 second sleep job is submitted with output from the job written to the file output.<JOBID>. When the job completes, we’ll display the contents of the output file.

[gsamu@flotta2 ~]$ bsub -o output.%J /bin/sleep 10
Job <111> is submitted to default queue <normal>.
[gsamu@flotta2 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
111 gsamu RUN normal flotta2 flotta2 */sleep 10 Apr 8 13:33
[gsamu@flotta2 ~]$ bjobs
No unfinished job found
[gsamu@flotta2 ~]$ more output.111
Sender: LSF System <lsfadmin@flotta2>
Subject: Job 111: </bin/sleep 10> in cluster <Klaszter> Done
Job </bin/sleep 10> was submitted from host <flotta2> by user <gsamu> in cluster
<Klaszter> at Thu Apr 8 13:33:42 2021
Job was executed on host(s) <flotta2>, in queue <normal>, as user <gsamu> in clus
ter <Klaszter> at Thu Apr 8 13:33:43 2021
</home/gsamu> was used as the home directory.
</home/gsamu> was used as the working directory.
Started at Thu Apr 8 13:33:43 2021
Terminated at Thu Apr 8 13:33:54 2021
Results reported at Thu Apr 8 13:33:54 2021
Your job looked like:------------------------------------------------------------
# LSBATCH: User input
/bin/sleep 10
------------------------------------------------------------
Successfully completed.Resource usage summary:CPU time : 0.07 sec.
Max Memory : 10 MB
Average Memory : 8.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : 3
Max Threads : 4
Run time : 18 sec.
Turnaround time : 12 sec.
The output (if any) follows:

6. Another job submission example

I know. Submitting a sleep job is enough to put you to sleep as the reader. So let’s submit something that’s a bit more interesting — the ubiquitous High-Performance Linpack (HPL) benchmark. Because, I’m running on a not-so state of the art system, the idea here is to show support for parallel workloads in LSF and not Linpack performance.

Although OpenMPI is available as a package on openSUSE Tumbleweed, it’s not compiled with support for LSF. So we’ll first build OpenMPI with support for LSF. The latest available OpenMPI v4.1.1 is used, and it will be compiled and installed to /opt/openmpi-4.1.1.

[gsamu@flotta2 ~]$ cd $HOME 
[gsamu@flotta2 ~]$ mkdir MPI
[gsamu@flotta2 MPI]$ cd MPI
[gsamu@flotta2 MPI]$ wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
--2021-05-14 10:07:23-- https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
Resolving download.open-mpi.org (download.open-mpi.org)... 2600:9000:2000:1000:16:d6ed:ec80:93a1, 2600:9000:2000:3000:16:d6ed:ec80:93a1, 2600:9000:2000:3400:16:d6ed:ec80:93a1, ...
Connecting to download.open-mpi.org (download.open-mpi.org)|2600:9000:2000:1000:16:d6ed:ec80:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17669382 (17M) [binary/octet-stream]
Saving to: ‘openmpi-4.1.1.tar.gz’
openmpi-4.1.1.tar.g 100%[===================>] 16.85M 22.9MB/s in 0.7s2021-05-14 10:07:25 (22.9 MB/s) - ‘openmpi-4.1.1.tar.gz’ saved [17669382/17669382][gsamu@flotta2 MPI]$ tar zxvf openmpi-4.1.1.tar.gz[gsamu@flotta2 MPI]$ cd openmpi-4.1.1[gsamu@flotta2 openmpi-4.1.1]$ ./configure --prefix=/usr/local/mpi --enable-orterun-prefix-by-default --disable-getpwuid --with-lsf
...
...
...
...
Open MPI configuration:
-----------------------
Version: 4.1.1
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)
Miscellaneous
-----------------------
CUDA support: no
HWLOC support: internal
Libevent support: internal
PMIx support: Internal
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: no
OpenFabrics OFI Libfabric: no
OpenFabrics Verbs: yes
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: yes
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
OMPIO File Systems
-----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no
Lustre: no
PVFS2/OrangeFS: no
[gsamu@flotta2 openmpi-4.1.1]$ time make -j4
...
...
...
...
real 12m2.782s
user 25m29.681s
sys 5m49.181s
[root@flotta2 openmpi-4.1.1]# make install

With OpenMPI ready, we can now move on to compiling HPL. HPL is compiled against OpenMPI v4.1.1 (compiled above) and using the OS supplied OpenBLAS libraries. HPL will be compilied and installed in /opt/HPL/hpl-2.3.

[gsamu@flotta2 ~]$ cd /opt
[gsamu@flotta2 opt]$ mkdir HPL
[gsamu@flotta2 opt]$ cd HPL
[gsamu@flotta2 HPL]$ wget http://netlib.org/benchmark/hpl/hpl-2.3.tar.gz
--2021-05-14 10:45:42-- http://netlib.org/benchmark/hpl/hpl-2.3.tar.gz
Resolving netlib.org (netlib.org)... 160.36.131.221
Connecting to netlib.org (netlib.org)|160.36.131.221|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 660871 (645K) [application/x-gzip]
Saving to: ‘hpl-2.3.tar.gz’
hpl-2.3.tar.gz 100%[===================>] 645.38K 933KB/s in 0.7s2021-05-14 10:45:43 (933 KB/s) - ‘hpl-2.3.tar.gz’ saved [660871/660871][gsamu@flotta2 HPL]$ tar zxvf hpl-2.3.tar.gz

With the HPL tarball downloaded and expanded, we need to create a Makefile for Linux on aarch64. We start from a generic Makefile

[gsamu@flotta2 HPL]$ cd setup
[gsamu@flotta2 setup]$ source make_generic
[gsamu@flotta2 setup]$ mv Make.UNKNOWN Make.Linux_aarch64

A diff is provided below which details the changes made to the Makefile to support Linux on aarch64.

[gsamu@flotta2 setup]$ diff -u1 ./Make.UNKNOWN ./Make.Linux_aarch64
--- ./Make.UNKNOWN 2021-05-14 10:51:20.242423277 -0400
+++ ./Make.Linux_aarch64 2021-05-14 11:49:38.555280377 -0400
@@ -63,3 +63,3 @@
#
-ARCH = UNKNOWN
+ARCH = Linux_aarch64
#
@@ -69,8 +69,8 @@
#
-TOPdir = $(HOME)/hpl
-INCdir = $(TOPdir)/include
-BINdir = $(TOPdir)/bin/$(ARCH)
-LIBdir = $(TOPdir)/lib/$(ARCH)
+TOPdir = /opt/HPL/hpl-2.3
+INCdir = /opt/HPL/hpl-2.3/include
+BINdir = /opt/HPL/hpl-2.3/bin/$(ARCH)
+LIBdir = /opt/HPL/hpl-2.3/lib/$(ARCH)
#
-HPLlib = $(LIBdir)/libhpl.a
+HPLlib = /opt/HPL/hpl-2.3/lib/$(ARCH)/libhpl.a
#
@@ -96,3 +96,3 @@
LAinc =
-LAlib = -lblas
+LAlib = -lopenblas
#

With the Makefile ready, build the xhpl binary. Note that we set the PATH and LD_LIBRARY_PATH here to reference the OpenMPI v4.1.1 installation.

[gsamu@flotta2 setup]$ export PATH=/opt/openmpi-4.1.1/bin:$PATH
[gsamu@flotta2 setup]$ export LD_LIBRARY_PATH=/opt/openmpi-4.1.1/lib:$LD_LIBRARY_PATH
[gsamu@flotta2 setup]$ cd ..
[gsamu@flotta2 hpl-2.3]$ ln -s ./setup/Make.Linux_aarch64 ./Make.Linux_aarch64
[gsamu@flotta2 hpl-2.3]$ make arch=Linux_aarch64

Our xhpl binary is now ready and is located in /opt/HPL/hpl-3.2/bin/Linux_aarch64. Before submitting xhpl to LSF for execution, the parameter file HPL.dat is tuned for my system (amount of RAM, cores, etc). Finally, xhpl is submitted to LSF requesting 4 cores.

[gsamu@flotta2 ~]$ cd /opt/HPL/hpl-3.2/bin/Linux_aarch64
[gsamu@flotta2 Linux_aarch64]$ bsub -n 4 mpirun -np 4 ./xhpl
Job <103> is submitted to default queue <normal>.

The xhpl job was successfully submitted above and was scheduled on 4 cores by LSF. As the job runs, we can view resource utilization and output using the bjobs and bpeek commands respectively.

[gsamu@flotta2 ~]$ bjobs -l 103Job <103>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Command <mpirun -np 4 ./xhpl>, Share group charged </gsamu>
Fri May 14 12:01:05: Submitted from host <flotta2>, CWD </opt/HPL /hpl-2.3/bin/Linux_aarch64>, 4 Task(s);
Fri May 14 12:01:05: Started 4 Task(s) on Host(s) <flotta2> <flotta2> <flotta2><flotta2>, Allocated 4 Slot(s) on Host(s) <flotta2> <flotta2> <flotta2> <flotta2>, Execution Home </home/gsamu>, Execution CWD </opt/HPL/hpl-2.3/bin/Linux_aarch64>;
Fri May 14 13:56:45: Resource usage collected.
The CPU time used is 25616 seconds.
MEM: 12.2 Gbytes; SWAP: 13.7 Gbytes; NTHREAD: 31
PGID: 2668; PIDs: 2668 2669 2681
PGID: 2686; PIDs: 2686
PGID: 2687; PIDs: 2687
PGID: 2688; PIDs: 2688
PGID: 2691; PIDs: 2691
MEMORY USAGE:
MAX MEM: 12.3 Gbytes; AVG MEM: 12.2 Gbytes
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg]
Effective: select[type == local] order[r15s:pg]
[gsamu@flotta2 ~]$ bpeek 103
<< output from stdout >>
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:N : 40000
NB : 192
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Left Crout Right
NBMIN : 2 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR00L2L2 40000 192 2 2 2365.61 1.8037e+01
HPL_pdgesv() start time Fri May 14 12:01:28 2021
HPL_pdgesv() end time Fri May 14 12:40:53 2021--------------------------------------------------------------------------------
...
...

Conclusion

Workload schedulers are a key ingredient in high performance computing clusters. IBM Spectrum LSF builds on over 28 years of experience in resource and workload management for HPC and can meet your scheduling needs from a few servers to the worlds’ largest supercomputers. We’ve demonstrated how quickly you can get up and running with IBM Spectrum LSF Community Edition on Arm and we’ve only scratched the surface of the powerful capabilities provided by IBM Spectrum LSF. From advanced GPU support to containerized job and dynamic hybrid cloud capabilities, the IBM Spectrum LSF family of products is your perfect companion on the HPC journey. Learn more about the IBM Spectrum LSF family of products here.

--

--

Gábor Samu
IBM Data Science in Practice

Senior Product Manager at IBM specialized in Spectrum Computing products. Over 20 years experience in high performance computing technology. Retro computing fan