What is your ideal cloud OS like brainstorming from scratch?

Rethinking Cloud Operating Systems with Rust

John Boero
TeraSky
Published in
19 min readDec 19, 2023

--

If you could create your dream OS from scratch what would it look like? This is a thought experiment I tried for myself, focusing on cloud simplification, security, and minimalism. After comparing many existing products and techniques, I built my own in a PoC to try it out.

Is it just me or is the cloud journey over-complicated with tools and workflows? Limitations in today’s operating systems have caused a plethora of tooling standards and complex automation pipelines to fester in the ecosystem. Suddenly you don’t just need a VM for a lift and shift monolithic migration. You need a Terraform expert, an Ansible expert, a Python expert and it all requires a team of CI/CD and GitOps experts to build workflows just to push basic configuration to each VM. That doesn’t even include the container runtime or orchestrator that will be used to configure and containerize clustered applications, so an expert Docker and Kubernetes team is required also. This doesn’t apply just to cloud environments either. Physical clusters, IoT networks, and HPC installations have their own configuration requirements and limitations such as latency requirements (HPC), low energy requirements (solar/IoT), or low RAM resources (microcontrollers).

What would your ideal cloud OS look like? With different conclusions you can jump to.

TL;DR:

It’s time to rethink the OS for modern cluster and cloud native computing. With C/C++ being phased out for Rust in security critical components of the OS, it’s an opportunity to rebuild init fresh for clusters and cloud estates based on secure languages like Rust.

Take the opportunity moving the OS to Rust to also move on from the historic library config in mixed formats around random files and folders in /etc, traditional legacy dependencies of glibc and various C libraries in the name of legacy UNIX compatibility. Service applications that need them are run in self-contained containers anyway.

We need a cloud OS so simple to configure and cluster that we can run our own “serverless” environments where the host OS itself is the least of our worries. The goal is not to force a replacement onto the ecosystem but offer an alternative that’s so easy to configure and automate that anyone can do it. First I’ll cover how a few existing solutions approach cloud and clustering challenges. Then I propose how an OS could combine all the best features of them for the ultimate solution. The result is a base and a framework for distribution builders who want a modern solution to replace the deprecated traditions of operating systems which were built for individual systems to an OS built from the ground up in a modern clustered and cloud native world.

Problem

The problem comes down to one critical OS component that has been a political battle for at least two decades: PID1 also known as init. When a Linux machine loads the kernel and mounts all local filesystems, it simply runs whatever program is stored at /sbin/init as parent process 1 to take care of all processes in a machine. Historical versions of init include xinetd, openrc, systemv (system 5 and its previous versions), tinyinit, upstart, and more recently tinyinit. For a Mac it’s launchd. For most modern Linux distributions, init is the controversial systemd from Red Hat. Systemd rewrote quite a bit of how a Linux machine stores configuration and boots, which is handy for a single server, desktop, or laptop with complex local hardware and device configurations. Since all configuration is local, the mixed ecosystem of tools emerged around how to manage those local files on individual nodes. Config management tools include Puppet, Chef, Ansible, Salt, etc. Now configuration is stored and managed via GitOps workflows, and nobody can correct so much as a simple typo in a README without complex approvals and redeployments. Simple mistakes may take hours to redeploy complex environments even using automation or public cloud.

This is why clustering and orchestration were created, to keep cluster configuration centralized on a set of servers. In most cases this config is actually still stored in Git for approval workflows but then pushed out to orchestration clusters once the local systemd configuration on the nodes can be established. Call me a whiner but this mix of local and orchestrated configuration all feels tedious in 2023. It also means the OS comes with all kinds of bloat, introducing unnecessary attack vectors and often requiring host-wide SSH for things like Ansible.

Speaking of bloat, have you ever checked the garbage that’s actually included in most cloud VM system images? I was at Red Hat during RHEL 6 when VMware and early cloud customers still ran VMs with hardware features like multipathd, SMART monitoring, and fan speed controls. Hardware is monitored by the host which meant nothing but waste inside a VM. Things like I/O schedulers and tuning optimizers also shipped enabled by default which only slow down a VM since the host is bound to repeat any such optimizations after the VM finally decides when to read or write data. Storage and network failover is also handled by the host, so most VMs never need to worry about that. In fact RHEL 6 is still available in AWS under Extended Support but its supported SSH ciphers are so old I can’t even connect to it without a workaround.

It’s amazing how little a VM kernel actually needs to worry about since most of the hard work and hardware is handled by the hypervisor/host.

Modern RHEL/Fedora derivatives and Amazon Linux are far more stripped down with less waste. Currently I see Amazon Linux 2 is a 1.2GB download that compresses down to 261MB with zstd. That’s a much leaner operating system where most of the bloat is only locale and language translations, but it still has the local configuration of systemd. At least current RHEL, Ubuntu, and Amazon Linux cloud images now ship minimal builds with kernels fine tuned for VMs.

Existing Examples

If one were to tear down today’s operating system to the foundations (Linux kernel) and rebuild for a modern server, cloud, or cluster experience, what would that look like? Consider a few recent industry projects:

CoreOS

CoreOS (also known as Container Linux) did a great job recognizing the local configuration problem. It shipped a minimal system image for VMs to purely run Kubernetes orchestrated workloads. It followed a stateless immutable design pattern where all local configuration relied on systemd and should be baked into the image before boot or passed to the systemd first-run service cloud-init. SSH was not enabled which eliminated a major attack vector in the name of simplicity and security. All configuration after boot should be passed through Kubernetes, which was ideal but then required all applications to be containerized. Red Hat created their own version of this called Fedora Atomic, but was wise to see CoreOS as a threat and acquired them to rebrand a Red Hat version called RHEL Atomic and upstream Fedora Atomic. These stateless images aren’t catching on as much as they should in a world where the config ecosystem is addicted to things like Ansible, which is both incompatible and unnessary in a stateless image. Coincidentally Red Hat also acquired Ansible, thus covering both stateless and stateless options.

Boot2Docker

Not many remember the original Boot2Docker but it was a true work of art. It was a great exercise in minimalism. This 40MB Linux distribution contained a compact kernel for VMs plus Docker and a custom docker-init that only runs Docker and a few VM guest services for hypervisors. This enabled a tiny lightweight Docker VM that booted in seconds and gave an instant container VM for those who couldn’t run Docker natively. It was quietly buried around the time CoreOS and Rancher came out and Docker realized they needed to stay competitive. Docker (post Mirantis acquisition) took the idea and turned it into Docker Desktop, a major success story and revenue source thanks largely to MacBook plebs who don’t want to maintain a full Linux VM for containers. The tiny image size of ~40MB was mostly Docker itself, being a large statically linked Go binary. This is perfect for most servers or VMs but slightly too big for some IoT or microcontroller cases where RAM might be in the kilobytes.

Rancher

Rancher is another Docker-based Linux distribution. Interestingly Rancher elected to run everything as Docker. This includes init, which is run as a separate Docker process called “System Docker.” This innovative meta approach was admirable and SUSE was quick to acquire Rancher to join the Docker OS race. As this article was being written Rancher was dissolved and no further development is taking place.

MOSIX

Going back in cluster history before containers, kubernetes, cgroups, or jails, there was/is an underrated product called MOSIX (and OpenMOSIX fork) which used a custom Linux kernel to migrate processes between cluster nodes in a simple time where most machines didn’t have multiple cores or sockets. To get a cluster running was actually very impressive back in the day, and fairly simple to configure beyond hardware setup. Migrating live Linux process can be delicate but is beautiful when done smoothly. Not many people think of MOSIX as an operating system but I would argue it was in fact a really innovative clustered Linux distribution that inspired cluster features of the future, including live VM migration and control groups.

Unikernel

Unikernel is an innovative idea that since a VM doesn’t need to worry about hardware, why not just build your application into a tiny kernel and skip the separate application? The result is minimal size and overhead with dedicated VMs that boot specially compiled applications in milliseconds and ran with incredible performance. This idea is very novel but very difficult for most coders. Developing and debugging custom kernels for every application is a tricky ordeal beyond the skills of most Python developers. While Docker was still peaking its unicorn horn in 2016 they acquired Unikernel which was a shrewd move but unfortunately didn’t monetize and never caught on.

Nomad

HashiCorp’s Nomad is a Go-based cluster orchestrator which combines many components of Kubernetes’ architecture in one binary instead of separate microservices like controller manager, scheduler, kube apiserver, kube-proxy, and kubelet. Even the CLI uses the same Nomad binary instead of a separate kubectl. The simplicity is its main advantage though it also natively orchestrates workloads besides Docker containers including Podman containers, VMs and native or chroot services with optional artifact download. This combining of functionality also eliminates redundant Go code which is statically compiled into each of the Kubernetes microservices. In my time at HashiCorp I actually wrote a custom init in 100 lines of C for a Nomad Linux distribution similar to Boot2Docker which was a minimal ~45MB and booted a Nomad VM in 2 seconds to start local services and containers but also whatever system services the cluster presented to download and run such as Docker and Podman daemons. The idea had a positive community response including interest from teams at Google and AWS. Individuals from AWS teams working on their Firecracker reached out as well as an IoT project that inquired about building it for microcontrollers in the field. HashiCorp leadership had no interest in pursuing Nomad as an operating system and declined development of an OS as too difficult. A few customers and partners in the ecosystem explored it on their own, including a European telco and a partner called ProtoCloud that tested it for rapidly autoscaling Nomad nodes. This experiment in Nomad lacked some basic OS features such as ACPI power event monitoring with the idea that stateless ephemeral nodes need only be drained of jobs and mounted volumes and then forced power off. Still it could use a bit more work to be production ready and was only appropriate for VMs or physical servers, not IoT or microcontrollers.

Photon

Photon is VMware’s stable Linux distribution for simplifying VMware adoption and cloud use cases. It’s a good fit for existing VMware uses as well as green field users looking to get started with Tanzu. The security hardening is a focus of the product but since it is purely open source, some enterprise users might struggle to justify production use without paid support. That may be holding back Photon since the feature set is based on widely available components like systemd. Other offerings in the market offer paid support but this is a good effort at making a specialized OS for minimal VM usage in the VMware ecosystem.

Bottlerocket

Amazon’s Bottlerocket was created to be a minimal container OS for cloud VMs. The smart difference is that the project put security first using Go and Rust instead of C and C++. If you inspect a Bottlerocket image you will find it’s pretty lean but still not the smallest VM image discussed here. The root partition is currently around 575MB due largely to static Go binaries. The Kubernetes agent (kubelet) alone is 120MB or about 20% of the partition. If I were to use Bottlerocket I would pass kubelet through upx compression, which currently brings kubelet down to 30MB. It’s worth noting that the OS still utilizes systemd for booting and has many extra services included. It has stripped out a lot of hardware monitoring but also does not include an oomkiller to protect out of memory errors by killing processes that hog all the system memory. This may not be necessary if the orchestrator properly limits processes with cgroups and reaps any missed zombie processes. You can easily customize Botttlerocket with another orchestrator like HashiCorp Nomad which is handy and also a good reason to include extra system fundamentals. Bottlerocket is a good option for VMs or physical environments but still too bloated for IoT or microcontrollers.

NixOS

Nix was created as both a package manager and an OS. The great thing Nix did was build from the ground up a declarative/idempotent configuration standard. You can simply create or consume a *.nix file to configure or build your ideal system image. This innovative approach is gaining momentum in the industry but not many people trust it in production yet. For myself it has a few downsides. First it creates a whole new domain specific language (nix) for configuring the OS. While the language has powerful constructs it doesn’t support schemas and leaves a lot of room for error. Also the end result is still an image based on local configuration of systemd, something the community has raised before. In fact
Michael Bishop has a minimal fork of NixOS called NotOS which is ships as a 47MB squashfs and uses Runit as init instead of systemd for embedded use cases which really rounds out NixOS well. I think Nix has the right idea for configuration but doesn’t yet go far enough to be the perfect solution I need. It will be interesting to see how Nix evolves, and it’s certainly not new to the industry already.

Composite Wish List

Combining the best features of each of these options could produce the ultimate operating system for cluster and cloud environments. There are two fundamental aspects to this problem:

  1. Standardized configuration with schema verification.
  2. The runtime(s) that interprets the configuration.

The industry has changed a lot about runtime over the years but hasn’t done much to modernize configuration. Every runtime maintains backward compatibility for a patchwork of local legacy configuration files. Step one should be to modernize and unify basic system configuration. Instead of each distro storing arbitrary config data in arbitrary directories I would love to have a simplified config file following a standardized schema. Here I’ll build my own ideal configuration schema and then in a later post I’ll build a PoC init to interpret it.

Zero Rust

If I were to create a proof of concept init for this idea, I would be tempted to throw it together in C++ and statically link it. I’ve been writing C++ for 26 years and it’s still my go-to language, and systemd was written in it too. In this case I think that would be wrong. One should approach a critical component like init with Zero Trust philosophy, which means I can’t even trust myself or a development team to write secure C++. Go has memory safety but we need a much smaller resource footprint than Go can provide. This leads me to Rust as the best option even if it’s not my strongest language. Luckily in the age of AI it’s not hard to whip up what we need guided by a team of LLMs.

Why are all of these orchestration solutions still basing themselves on large Go binaries like Docker, Podman, Nomad, and Kubernetes? A truly universal orchestrating operating system that covers minimal environments like IoT and edge devices can’t mandate a chunky Go binary. If you could write a lightweight init that simply starts networking and gets its config within a 256K memory footprint then you could write an OS that efficiently covers any use case from Cloud, IoT, Edge, HPC, or anything you can supply a kernel for. This includes the possibility of expanding beyond just Linux kernels to BSD, Mac, Windows, or ReactOS. If your cloud configuration specifies that every node in a cluster should start a container runtime and an orchestrator that’s fine. If your IoT fleet configuration says that every tiny weather sensor in the globe should just update certificates and run monitoring services then that works via the same method. If you want your configuration to stay local then fine. If you want it to come from a Git repo then there is no need for complex deployment pipelines.

In an HPC environment with more than 100,000 servers to get config without DDoSing your own cluster then you’ll also need to support multicast or a UFTP service to gain an edge over Slurm or Mesos as the top HPC schedulers. Specify your remote config location in your init if the OS needs to know where to monitor configuration changes. This also presents an opportunity to replace some legacy traditions in Linux that have become downright ugly maintaining backward compatibility for legacy C libraries.

  1. Fstab: libc+libmount. The /etc/fstab file lists the local filesystems that should be mounted by init during boot. It still uses ugly schemaless CSV formatting and is prone to errors. When systemd rewrote its mounting subsystem it didn’t replace fstab with something more modern. It stayed backward compatible with the exiting space delimited format, even though the last fields of each line (backup flag and fsck order) have been deprecated and unused for years.
  2. Network: libc. In a server-oriented OS for cloud and IoT, network should be a primary function to boot, rather than an afterthought of local configuration. Linux network config services have been rewritten almost as many times as init. Iterations include network, NetworkManager, and systemd-networkd. Most of these have maintained backward compatibility with arbitrary INI environment files stored in /etc/sysconfig/network-scripts/ or equivalent files. Different distributions store them in different locations which complicates automation. There is a good reason the default behaviour of cloud-init is to start simple networking with DHCP on the default network interface — this is the first function most cloud VMs will care about. VMs usually don’t need to worry about physical network configurations like bonds, teams, 802.11ad, LACP, etc. Time to simplify networking and treat IPv4 and IPv6 equally from the ground up.
  3. DNS: libresolv. Don’t forget DNS. If you can freehand type a valid /etc/resolv.conf.conf to generate a proper /etc/resolv.conf on the first try, you are a miracle worker. This also depends on whatever rewrite of the network subsystem is trying to interpret the messy previous generations of config files and folders. These files also still use an arbitrary INI/CSV format with ambiguous documentation and no clear versioning. NSS is another service that has been loosely versioned over the years and uses another proprietary INI config format.
  4. Sysctl: libc. Kernel tuning still relies on backward compatible workarounds for ugly INI= files. Viewing /etc/sysctl.conf on a modern system reveals that this has become more complicated across several directories making it unclear what is set where and which files take precidence:
    `# sysctl settings are defined through files in
    # /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.`
  5. Logging: libc. Logging is even more contentious than init. Mixed generations of apps store their log files arbitrarily in /var/log/ or systemd-journald or various locations which are all local to the system but inevitably need to be copied to external log aggregators or rotation schemes. Logging format is also subject to the logging subsystem, where systemd throws everything into a single massive file full of bloated prefix details on each line, rather than organize logs into folders and files which are much easier to search.
  6. Services: libsystemd-shared. Service configuration syntax is entirely up to init. Various generations have used simple shell scripts in /etc/init.d/ or INI files like systemd units. Init services are usually only local hence the need for orchestrators.
  7. Firewall: libxtables. Firewall config varies on the firewall service used. Currently firewalld is managed with Python but still uses XML files to store rules. The Linux backend in modern solutions has largely replaced iptables with nftables, which is mostly provided by the kernel itself.
  8. SSSD/NSS: libnss3. Local and remote users, whether LDAP/KRB/SSO are all configured via proprietary INI: formats. It’s up to the user to find the right documentation and examples for how to configure these. I’ve been deployed on quite a few engagements costing $15,000 or more to fly onsite just for sorting out single sign-on issues for Red Hat customers.

That’s the tip of the iceburg but see how many different standards there are for handwritten config files? Talk about error prone. CSV? INI? XML? Why isn’t there a standard config format with a strong schema to guide users and correct errors before it’s too late?

JSON DNA

Why not include all of the above in a versioned JSON file with a proper schema? Support environment variables and templating. Include optional details about remote configuration sources and where to curl or rsync the base configuration for your entire cluster. Much like every cell in an organism has a full copy of the organism’s DNA, every node in your cluster can have a copy of the entire cluster’s configuration. If you need every machine in your estate to run a service in a certain way, update the cluster config in a central location, with no need for complex pipelines or tooling. Treat configuration of your cluster like gene therapy and suddenly config becomes hereditary.

Configuration

The config should specify the following objects, which I’ve drafted in a formal JSON schema. These are some examples that helped formalize the schema. Keeping the entire config in one file may be a security risk as multiple components have read access to the global config so these sections may be included in a single file or separate files. Secrets and keys must be managed by an external secret manager or local mechanism like TPM. There is no guesswork here as the format is guided and verified with the schema:

  • Source: Optionally specify where and how to sync config updates. Instead of deploying via pipelines and tooling. You may also elect to have your config directly mounted remotely (/etc/sysconfig).
"source": {
"source": "https://raw.githubusercontent.com/yourorg/config/samplecluster.json",
"synctype": "http",
"interval": "1h",
"randomwait": false
}
  • Mounts: List static local mounts replacing /etc/fstab.
  "mounts": [
{
"device": "UUID=5b38cd4d-3c52-4aed-962f-88e9a14bc535",
"fstype": "ext4",
"path": "/",
"options": "noatime,discard",
"cryptkey": "$LUKS_KEY_FROM_SECRETS"
},{
"device": "10.10.62.163:/logs/$HOSTNAME",
"fstype": "nfs",
"path": "/var/log",
"options": "defaults"
}
]
  • Networks: Configure local networks replacing /etc/sysconfig/network*. One thing that would really help IoT devices is the addition of a schedule option, such that a network or wireless network can shut down on a schedule to conserve power or reduce attack vectors.
"networks": [{
"comment": "Enable dhcp6 on any wifi interface starting with wl.",
"name": "wifi",
"interface": "wl*",
"ipv4": {
"enabled": false
},
"ipv6": {
"config": "dhcp"
},
"schedule": {
"type": "timer",
"on": "5m",
"off": "55m"
}
}, {
"comment": "Disable any other interfaces besides wifi.",
"interface": "any",
"enabled": false
}]
  • Services: List basic local services and artifacts. Alternatively specify that services should be included via a directory or wildcard match. Use artifacts for configuration also. Artifacts can and should be signed and verified with a simple signature stored relative to the artifact.
"services": [{
"name": "podman-server",
"command": "/usr/bin/podman system service -t 0 --cgroup-manager=cgroupfs"
"user": "${RANDOM}",
"env": [
"VAR1": "test",
"VAR2": "$TAGS[whatevertagyouwant]"
]
},{
"include": "/etc/sysconfig/*.json"
}]

Rust can also have size issues. A simple 30 line Rust app that validates a JSON file against a schema compiles to a minimum 3.7MB in my experience which isn’t quite as bad as Go but . That can be brought down a bit with upx. In an embedded device with 128KB RAM this might not be practical. Microcontrollers (MCU) with limited memory are usually not compatible with bulky but memory-safe Rust or Go binaries. An AWS team asked me if it were possible to write a specialized MCU version of Nomad agent. That would be a great addition but it would probably require a complete rewrite in C++ so that the agent doesn’t consume all memory in the device or worse. It may make sense to write and maintain parallel init binaries in C++ for IoT and Rust for cloud and larger environments.

Luckily in the AI age, LLMs can make just about anybody a Rust coder. Some tinkering with ChatGPT and Bard and voila — the first Linux init I know of written in Rust guided by GenAI.

RFC

There is already a draft JSON schema and repo coming together. Comments both positive and negative are welcome. This spec does not include service discovery or standard inter-node communications and monitoring which may be included as well or provided by an external service if required. A PoC init to actually interpret config will be built in Rust to test viability of a unified configuration.

What Happens Next

Modern Linux operating systems power most of the cloud today, but they are largely built for standalone computing or desktop/server use cases instead of clusters or cloud VMs. This monolithic approach to init has resulted in overly complex deployment pipelines and tool ecosystems that could be much simpler if a new OS was designed from the ground up with cloud in mind. Taking previous examples from the industry shows what worked well and what hasn’t worked well in previous experiments. Now it’s time to put that all together and build something from scratch to pair with any kernel and use case the user requires. This idea aspires to provide a common standardized OS for any cluster use case from local and cloud VMs to physical servers to IoT on anything from a fleet of weather balloons to a continent full of cable set-top boxes and even thousands of low latency HPC servers. Whatever is built should serve as an extensible framework for distributions to be built and customized as well. The next step is building a PoC init to interpret the config standard established here. If anybody is interested or would like to contribute missing specs to the schema (there are admittedly many) please contribute feedback to the repo.

--

--

John Boero
TeraSky

I'm not here for popular opinion. I'm here for hard facts and future inevitability. Field CTO for Terasky. American expat in London with 20 years of open source