Can you schedule heterogeneous jobs and containers in the cloud without systemd? What comes after systemd?

Nomad vs systemd

John Boero
HashiCorp Solutions Engineering Blog
5 min readJun 25, 2020

--

With another HashiConf in the books, I’m feeling another wild experiment coming on. Every HashiConf I get a few questions about Terraform that tell me a lot of people have the wrong idea about Terraform. Since it’s one of our most popular products people tend to try to do everything with it, including configuring and deploying services via provisioners and pull requests. This is the job of schedulers or orchestrators, not infrastructure provisioning tools.

Terraform apply is an inefficient way to start/stop a service.

Even some of our own Terraform examples demonstrate nested heredocs for installing binaries, configuring systemd units, and starting services. This is an anti-pattern but demonstrates legacy support for systemd. What if you didn’t need to mess with any of that? What if we cut out systemd completely and run Nomad as Init? A lot of ISVs have tried to build minimal Linux distributions for cloud and cluster nodes. CoreOS did a great job forking minimal Gentoo for running containers and Kubernetes clusters. Kubernetes only does containers so systemd was still included to manage local system resources. There have even been feature requests for Nomad to support systemd units via a task driver. Nomad can both schedule local services and orchestrate containers. A feature comparison makes sense of it:

Scheduler comparison of Nomad and systemd at a glance.

Whether you love or hate systemd its rollout has been controversial and problematic. Written in C it has presented numerous memory leaks and buffer CVE vulnerabilities into critical production environments. Nomad certainly hasn’t been perfect but written in GoLang it isn’t prone to memory leaks or buffer vulnerabilities. It’s also built as a static Go binary without the 41+ dynamic library dependencies of systemd. This nitty gritty experiment is to figure out what constitutes a bare minimal image for running a Nomad agent.

Why do we need systemd to run on every node in a cluster just to support the cluster orchestrator?

To skip kernel modules I build a minimal kernel with the latest 5.8rc source with a KVM guest profile and run it directly with no initrd and a custom init:

$ git clone --depth=1 \
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ make kvm_guest.config
$ make -j

This means it has the virtio and e1000 network drivers built in with a small 9MB bzimage. It also builds in lo (loopback interface) which funny enough still ships as a module with most kernels. If you build a kernel at least v5.13 you also get the advantage of zstd compression. Nomad jobs can add kernel modules later if needed. Init must be run without arguments and Nomad client needs network access so I wrote a separate binary in just under 100 lines of C. It performs minimal prep work:

  1. Remount root filesystem as read/write since it mounts ro by default.
  2. Mount /proc, /sys, and related virtual filesystems.
  3. Use IOCTL to enable network services for lo and eth0.
  4. Use sdhcp to find an address for eth0.
  5. Fork Nomad agent and wait indefinitely.
Here is a VM running only the Linux kernel and Nomad agent. It boots in just over 1 second and uses just 20MB of RAM idle.

While KVM can boot the kernel in about 1.2 seconds the DHCP and Nomad start process actually take a bit more — about 10 seconds total. Static IPs could shave off about 5 seconds. Also note that when we boot we don’t have docker, cgroups, or java accessible so the drivers are missing or unhealthy. The good news is this is just enough for Nomad to wake up and take orders from the Nomad cluster. If the cluster includes system services such as Docker or java, they can be deployed and started automatically without a redeploy or a Terraform apply. Nomad jobs allow artifacts to be installed as needed at runtime. It also would be wise to include Consul agent, which isn’t actually required for the image.

Raw image size of the qcow2 with dependencies comes to 177MB. With UPX binpacking and QEMU compaction, that drops to 54MB. An XZ archive without binpacking brings it down to about 29MB. This constitutes a bootable system image without systemd, potentially smaller than the 45MB Nomad Zip download.

Note that to deploy and configure running services on a Nomad cluster is a stateful configuration managed centrally on the Nomad servers and requires no provisioning or Terraform and no synchronization of systemd unit files across all nodes.

[UPDATE: 08-OCT-2020] As Bill Gates once said, I finally got around to adding a bootloader. Adding extlinux and the kernel directly into the image yields a 100% bootable image without needing to specify the boot options in QEMU. This is an ideal image for Nomad Autoscaler. Updates included in the repo at the bottom of this article.

Essentially “NomadOS” booted inside GCP via custom 54MB image.

Systemd to Nomad

What about starting services? We can start by running some services that would normally be controlled by systemd. Remember there was a Nomad feature request to add systemd units as a supported task driver. That would be nice but then I need to keep my systemd units around. Wouldn’t it be better if we could convert systemd units directly to Nomad system jobs? I created a crude way to do that with a few lines of bash and templating. There are options to output JSON or HCL versions:

Instantly convert basic systemd units to Nomad system jobs. Not all attributes are supported but it’s a start.

This system job will attempt to run on all applicable nodes in the cluster. I would need to add an artifact to it for Nomad to download the Consul binary and config. I also have the option of configuring things directly with the job spec. If I run this as-is on a fresh minimal Nomad client I have two problems — no Consul binary and no “consul” user/group to run the process, so it would take some work or packaging to properly get some services running.

Conclusion

This has been an interesting experiment and proof of concept. Nomad isn’t ready to take over init from systemd everywhere just yet but it presents an interesting future for disposable infrastructure. I don’t think systemd will be the preferred Init forever and it’s interesting to think about what might come next.

In the meantime if you haven’t given Nomad a try and you’re still struggling with over-complicated service deployments via provisioning tools definitely give Nomad a try. The repo for everything above can be found here: https://github.com/jboero/nomadinit

--

--

John Boero
HashiCorp Solutions Engineering Blog

I'm not here for popular opinion. I'm here for hard facts and future inevitability. Field CTO for Terasky. American expat in London with 20 years experience.