Supercomputers like Oak Ridge National Lab’s Summit schedule HPC jobs at super scale to solve problems where nanoseconds matter.

Supercomputing with HashiCorp

John Boero
HashiCorp Solutions Engineering Blog
7 min readMar 13, 2020

--

HashiCorp isn’t just about digital transformation and the cloud. Nomad and Consul are also great for heterogeneous workloads. We often say that, but what does it look like? HashiCorp engineering supports x86 and ARM builds across multiple variants and OS releases, but what about Power and IBM Z?

With the widespread usage of x86 processors in clouds, it’s important to remember that other platforms still exist and are still very relevant. Don’t forget the current #1 publicly listed supercomputer in the world [Update #2, top slot was taken by Japan with ARM] is the US DOE’s $325 Million Summit which runs RHEL on 9,216 POWER9 22-core CPUs with 27,648 Nvidia Tesla V100 GPUs (5,120 cores each). Summit has recently been assigned to tackle protein folding for potential COVID-19 treatment. Since some of my travel has been cancelled for COVID-19, I’ve taken some time to experiment with alternative architectures and our HashiCorp products, including Power 9 and IBM Z (S390x). The results have been surprisingly smooth. Note that these results are not (currently) supported by HashiCorp engineering and are provided by me as-is without warranty. Just a warning: this article is highly technical and gritty. In fact I can practically feel my beard grow as I write it.

Consul build on Power9. Nomad works too!

What are the benefits of doing this? With Consul and/or Nomad running on a local PPC or S390x installation, one can schedule heavy HPC jobs at scale locally with or without containers and have them seamlessly connect to or from an external cloud application using automatic mTLS. Sometimes it’s helpful to search the on-prem mainframe data warehouse securely from apps on a Kubernetes cluster in the public cloud. Another possibility is to run Consul Mesh Gateway on common network hardware, which would be pretty darn helpful meshing your cloud natively into your local environment. Notice that the PPC binary is Cisco compatible.

Building for PPC and S390X

IBM’s PowerPC architecture has been around quite a long time. Many estates have large investments in PowerPC servers which are usually deployed to a standard rack mount chassis alongside traditional x86 servers. PowerPC legacy includes old Mac hardware as specified by IBM, Motorola, and Apple. These RISC chips had vastly different architectures from CISC x86, including my favourite big endian storage order. This means that multi-byte values appear in natural order whereas x86 is reversed. You can’t just throw an x86 binary into a ppc64le system. You’ll need to build binaries from source.

Guess the order from the hex dump! PPC has since enabled dynamic switching of endian order.

IBM Z or S390x by contrast has the form factor of a mainframe. In its current iteration, Z15 is between 1 and 4 standard racks linked up to fit into standard DC environments. These aren’t your typical legacy mainframes — these are modern beasts with plenty of horsepower. They tend to cost from a few hundred thousand to a few million $USD. If you do not have at least one spare $1M mainframe sitting around your lab, you can emulate one for free with QEMU or Hercules. In fact, all of my PPC and S390X build environments are emulated within QEMU on x86_64. ARM can also be emulated with QEMU if you’re evaluating an ARM installation. You obviously won’t see the true performance of native hardware, but for a build environment you won’t need to.

Environment

You can follow along and set up your own environment with me. I’ll be doing this on my Xeon Linux workstation and KVM but it can also be done on a Mac or Windows PC with pure QEMU. Fedora standard repositories contain all the QEMU variants I need. A full list of supported options is here:

https://wiki.qemu.org/Documentation/Platforms

The tricky part is getting a valid OS installed on such a VM. Fedora, CentOS, and Ubuntu are all currently built for both PPC and S390x. CentOS for PPC is available but S390x is only available as a special RHEL subscription from Red Hat. In the end I’ve had the best luck with CentOS 7 on Power and Fedora 29 on S390x. In my experience Fedora 30 and 31 fail with RPMDB corruption during install. Ubuntu Server works great too. Ideally this would be added as part of our automated CD pipeline using QEMU build and testing. I’ll start with Power9, which ended up being much simpler than S390. I keep a spare drive just for things like VM and ISO images at /aux:

$ qemu-img create -f qcow2 -o size=40G /aux/qemu/power_el7.qcow2
$ qemu-system-ppc64 -M pseries -cpu POWER9 -smp 12 -m 16G \
--device virtio-scsi \
--drive file=/aux/qemu/power_el7.qcow2 \
--nographic \
--drive file=/aux/iso/CentOS-7-ppc64-Minimal-1908.iso,format=raw,if=none,id=c1 \
--device virtio-net-pci,netdev=tap0,mac=12:47:5a:72:7b:20 \
--device scsi-cd,drive=c1

I have 2x6 physical CPUs with 128GB RAM and I set -smpto 12 CPUs, 16GB RAM; but adjust CPU and RAM for your own needs. You can actually leave off the -nographic flag if you want a GUI to install. Once the OS install finishes we’ll remove the last 3 lines involving the ISO device and continue to boot with just hda. While this installs we can actually start the S390x install in another session:

$ qemu-img create -f qcow2 -o size=40G /aux/qemu/s390x_fed.qcow2
$ qemu-system-s390x -M s390-ccw-virtio -m 16G \
-cpu qemu,vx=off \
--hda /aux/qemu/s390x_fed.qcow2 \
--nographic \
--device virtio-scsi \
--device scsi-cd,drive=c1 \
--drive file=/aux/iso/Fedora-Server-dvd-s390x-29-1.2.iso,format=raw,if=none,id=c1

Note that the flag “-cpu qemu,vx=off” is very important. I spent almost a day troubleshooting and inspecting packets when it turned out there is a bug with using hardware native network. TLS handshakes go wrong. Also, don’t specify multiple CPUs. S390x emulation doesn’t currently support multi-threaded operation whereas the ppc64le does.

Don’t you hate it when you’re just trying to shake hands and you accidentally divide by zero using both integers and floating points?
Fetching https://golang.org/x/tools/cmd?go-get=1 
https fetch failed: Get https://golang.org/x/tools/cmd/cover?go-get=1: net/http: TLS handshake timeout
panic: runtime error: integer divide by zero
[signal SIGFPE: floating-point exception code=0x1 addr=0x2181c2 pc=0x2181c6]

This issue is common across a few emulated platforms including RISC-V. Note that S390x unlike ppc64le is fixed Big Endian, which could explain the unexpected zero. Go is unable to fetch modules over TLS during build. In my experience the OS often won’t install either. Don’t forget the “vx=off” flag. This will disable the hardware virtualization and really slow performance but is required to make things work.

Now, as the systems build you can walk away for a coffee. Emulation can be sluggish. If your system installs without issue and you have an RPM-based distro, you can use my SRPM specs to automatically build and generate an RPM for the local platform. First we will need to install go and set up our go environment. Rather than install older go 1.10 from repos, directly download and install go v1.13.8:

$ wget https://dl.google.com/go/go1.13.8.linux-ppc64le.tar.gz
$ tar -xvzf *.gz
$ export GOROOT=$HOME/go
$ export GOPATH=$HOME/gopath
$ export PATH=$GOROOT/bin:$GOPATH/bin:$PATH

All we need is to install make and rpm-build. Now we have a build image. You can use this QCOW2 as is or you can import it into virt-manager for simplicity. It may take a few tries to get all dependencies (go reminds me of my Gentoo days) but I tend to get a stable build.

There is a trick to Nomad though — the architectures must be added to the GNUMakefile. A PR can be found here. Once this is added the proper targets are available.

$ wget https://raw.githubusercontent.com/jboero/hashicorpcopr/master/consul.srpm.spec
$ wget https://raw.githubusercontent.com/jboero/hashicorpcopr/master/nomad.srpm.spec
$ rpmbuild --undefine=_disable_source_fetch -ba consul.srpm.spec
$ rpmbuild --undefine=_disable_source_fetch -ba nomad.srpm.spec
$ sudo yum install rpmbuild/RPMS/*.rpm

Now simply adjust config in /etc/nomad and /etc/consul.d and start services. In order for Nomad to use exec jobs make sure to run as root.

Nomad showcasing happy Power9 and Z agents among my x86 machines.
I can use arch metadata tags to start certain jobs only on ppc64le machines.
Power9 and Z happy next to my Xeons showing service discovery and Mesh in Consul.

Remember the Summit cluster had 27,648 NVidia GPUs? Nomad’s NVidia plugin can help schedule jobs using these within containers. Or raw_exec can be enabled to take control of the raw hardware.

I will publish the builds for those that don’t want to do this from scratch, but for now it’s an interesting use case. I’ve worked with HPC clusters with thousands of mixed nodes but now I can emulate my own flock of mainframes and Power nodes.

Conclusion

Now that we can install Consul and Nomad, we can schedule and mesh them directly into the public cloud. What does that mean? Essentially my mainframe and Power estate can directly and securely connect up and down to my Consul services in the cloud. Next time I’ll demonstrate how we can do that with the new HashiCorp Consul Services on Azure, and I’ll try to connect a Kubernetes pod in Azure directly to a database running on our Power environment.

--

--

John Boero
HashiCorp Solutions Engineering Blog

I'm not here for popular opinion. I'm here for hard facts and future inevitability. Field CTO for Terasky. American expat in London with 20 years experience.