Why fix Kubernetes and Systemd?
Recently I was finally able to muster enough courage to open source a new project.
I have been working on the project very casually in my free time. One of the questions I have been getting asked as more folks join Aurae is why are you doing this? Why does Aurae exist?
I think most folks are interested in understanding what exactly is the motivation for the project. Additionally what makes Aurae unique? Where is it different — and of course why does it even exist in the first place?
So let’s start with some problem statements, additionally you can check out the original whitepaper on Aurae.
Looking at Systemd
There is a lifetime of arguments both for and against systemd. I wouldn’t be foolish enough to try to pack all of those arguments into this article. The point I am trying to make here is that for the most part systemd is great.
However from the enterprise platform perspective there are some things to call out. I am approaching these topics with cloud enterprise infrastructure in mind. These gripes may not be relevant to most desktop/hobbyists.
My high level concerns with systemd in cloud/enterprise are:
- Monolithic architecture.
- Assumption that systemd owns “the world”. No controls for higher order multi tenant layers.
- It assumes there is a user in userspace. (EG: Exposing D-Bus SSH/TCP)
- Bespoke/esoteric mechanics with the toolchain. There is a lot to learn, with bespoke client D-Bus stacks.
- IPC mechanisms with D-Bus aren’t built for multi tenancy interaction.
- Some of the assumptions systemd makes about an environment break down inside a container. EG: pid namespaces and running as pid 1.
Edit: After some illustrious comments on hacker news I returned and added some more detail here. I want to be clear — for the most part systemd is fine. I don’t believe systemd is the problem. However if my aim is to tackle a multi tenant node, I do believe things could be simplified for the use case at hand. In my opinion a lot of the features that systemd could be offered up to a control plane in a multi-tenant and standardized way with a better set of controls and APIs. These features (and more) are effectively what would make up the Aurae standard library.
Now — a problem I do feel safe calling out is where Kubernetes started duplicating functionality of systemd. Kubernetes also approaches scheduling, logging, security controls, rudimentary node management and process management a cluster level. The outcome is yet another fragmented system where the true control plane is unknown. In my opinion Kubernetes has re-created many of the same semantics of systemd — only at the distributed/cluster level. This is problematic for both operators as well as engineers building on top of the system.
Thus, an operator, engineer, and end-user I end up needing to learn, and manage both systemd and Kubernetes at the same time.
The duplication of scope is one of the main motivating reasons behind Aurae. I believe that distributed systems should have more ownership of what is running on a node. I am not convinced that systemd is the way forward to accomplish the goal of exposing a multi tenant API to a higher order control plane. This is the first reason why I started Aurae.
Kubernetes has a node daemon known as the kubelet that runs on each node in a cluster. The Kubelet is the agent that manages a single node within a broader cluster context.
The kubelet runs as an HTTPs server, with a mostly unknown and mostly undocumented API. Most of the Venn diagram above pertains to the Kubelet more so than “Kubernetes” itself. In other words the Kubelet is the Kubernetes-aware systemd alternative that runs on a node.
So what schedules the kubelet and keeps it running?
That would be systemd.
So what schedules the node services that the kubelet depends on?
That would be systemd.
For example in the most simple Kubernetes cluster topology systemd is responsible for managing the kubelet, system logging, the container runtime, any security tools, and more.
However any platform engineer will tell you that the architecture above is far from enough to operate a production workload.
In many cases platform teams will also need to manage services such as Cilium or HAproxy for networking, security tools such as Falco, and storage tools such Rook/Ceph. Most of these require privileged access to the node’s kernel, and will mutate node level configuration.
While these services can be scheduled from Kubernetes there are some noteworthy concerns which begs the question — should they be scheduled from Kubernetes?
Do we really want to privilege escalate our core services into place from a DaemonSet? What happens if a node can no longer communicate with the control plane? What happens if the container runtime goes down? What else is running on a node that is managed outside of Kubernetes? What else is possible to run on a node that is currently not supported by Kubernetes?
The node should be simple. Managing node services should be simple. We should manage node services the same way we manage cluster services.
All of these thoughts kept me up at night for years. In my opinion there is a tremendous amount of untapped opportunity that the industry is prevented from innovating around, simply because it is hidden away from the scope of Kubernetes APIs.
So why do I want to tackle systemd, the kubelet, and the node? Why did I decide to start Aurae?
To be candid — I want to simplify systems that run on a node — and I want the Node to be managed in the same place the rest of my infrastructure is managed.
I don’t think a platform engineer should have to manage both a systemd unit, as well as a Kubernetes manifest for node resources — and if we do — they should work well together and be aware of each other.
Looking at the stack it became obvious to me that there was an opportunity to simplify the runtime mechanics on a single node. As I began to explore the architecture more, the more I realized that the sidecar pattern was quite evident that something else was wrong with my systems.
While there isn’t necessarily anything intrinsically wrong with running sidecars themselves, I do wonder if the uptick in sidecar usage is remnants of an anti-pattern? Do we really need to inject logic along side an application in order to accomplish some lower level basics such as authentication, service discovery, and proxy/routing? Or do we just need better controls for managing node services from the control plane?
I wonder if having a more flexible and extensible node runtime mechanism could start to check the boxes for these types of lower level services?
Sidecars should be Node Controllers
Aurae calls out a simple standard library specifically for each node in a cluster. Each node will be re imagined such that it is autonomous and designed to work during a connectivity outage on the edge. Each node gets a database, and will be able to be managed independently of the state of the cluster.
The Aurae standard library will follow suite of Kubernetes’ API in that it will be modular. Components should be able to be flipped out depending on the desired outcome.
The various subsystems at the node level will be implementable and flexible. Scheduling a service such as HAProxy, a security tool like Falco, networking tools such as envoy or cilium will follow a familiar pattern of bringing a controller to a cluster. However the state of the node will persist regardless of the status of the control plane running on top.
By simplifying the node mechanics and bringing container runtime and virtual machine management into scope we can also knock a few other heavy hitters that have been ailing the Kuberenetes networking ecosystem for some time.
- IPv6 by default.
- We can support network devices as the primary interface between a guest and the world.
- We can support multiple network devices for each guest.
- We can bring NAT traversal, proxy semantics, and routing into scope for the core runtime.
- We can bring service discovery to the node level.
- We can bring authentication and authorization to the socket level between services.
There is a lot to unpack here, however starting with some of the basics first such as giving a network device to every guest we should be able to iterate over the coming years.
I promised myself I wasn’t going to put the word “Security” in a box and say that was going to be enough. I want to explain how this system will potentially be safer.
We can standardize the way security tools are managed and how they operate. Fundamentally most modern runtime security tools leverage some mechanism for instrumenting the kernel. (Such as eBPF, Kernel Modules, netlink, Linux audit, etc) We can bring generic kernel instrumentation into scope of Aurae and provide controls on how higher order services can leverage the stream. This is a win for both security as well as observability.
Additionally we can create further levels of isolation by leveraging virtual machine mechanics by bringing VMs to the party as well.
Aurae intends to schedule namespaced workloads in their own virtual machine isolation sandbox following the patterns laid out in Firecracker with the jailer. In other words, each namespace gets a VM.
This feature would push multi tenancy a step forward, while also addressing many of the other concerns listed above such as the sidecar antipattern.
Imagining a “cluster aware” or “API centric” process scheduling mechanism in a cloud environment is exciting.
Pausing/Debugging with ptrace(2)
For example systemd integrates will with other low levels of the stack such as ptrace(2). Having a cloud-centric process manager like Aurae means we could explore paradigms such as pausing and stepping through processes at runtime with ptrace(2).
Cleaner Integrations with eBPF
We can explore eBPF features at the host level, and namespace virtualization level. All of this could be exposed at the cluster level and managed with native authn and authz policies in the enterprise.
Kernel Hot Swapping with kexec(8)
Even mechanisms like Linux’s kexec(8) and the ability to hot-swap a kernel on a machine could be exposed to a control plane such as Kubernetes.
SSH tunnels are a reliable, safe, and effective way of managing one-off network connections between nodes. The ability to manage these tunnels via a metadata service to enable point-to-point traffic is yet another feature that could be exposed to higher order control plane mechanisms like Kubernetes.
More than just Kubernetes
Having a set of node-level features exposed over gRPC would potentially enable more than just a Kubernetes control plane.
Lightweight scheduling mechanisms would be possible to run against a pool of Aurae nodes. Kubernetes is just a single example of how this could be enabled.
Written in Rust
So we started coding the mechanics of these systems out, and we decided to write Aurae in Rust. I believe that the node systems will be close enough to the kernel that having a memory safe language like Rust will make Aurae as extensible as it needs to be to win the Node.
Aurae is composed of a few fundamental Rust projects, all hosted on GitHub.
Auraed (The Daemon)
Auraed is the main runtime daemon and gRPC server. The intention is for this to replace pid 1 on modern Linux systems, and ultimately replace systemd once and for all.
Aurae (The Library)
Aurae is a Turing complete scripting language that resembles TypeScript that executes against the daemon. The interpreter is written in Rust and leverages the gRPC rust client generated from the shared protobuf spec. We plan on building a LSP, syntax highlighting, and more as the project grows.
Client Libraries (gRPC)
The client libraries are auto generated and will be supported as the project grows. For now the only client-specific logic such as the convention on where TLS material is stored lives in the Aurae repository itself.
We will need to build a Kubelet and Kubernetes shim at some point that will be the first step in bringing Aurae to a Kubernetes cluster. We will likely follow the work in the virtual kubelet project. Eventually all of the functionality that Aurae encapsulates will be exposed over the gRPC API such that either the Kubernetes control plane, or a simplified control plane can sit on top.
Aurae is liable to turn into yet-another monolith with esoteric controls just like systemd. More so Aurae is also libale to turn into another junk drawer like Linux system calls and Linux capabilities. I want to approach the scope cautiously.
I have considered many of the lowest level computer science concerns, and the highest level of product needs to form an opinion on scope.
- Authentication and Authorization using SPIFFE/SPIRE identity mechanisms down to the Aurae socket.
- Certificate management.
- Virtual Machines (with metadata APIs) leveraging Firecracker/QEMU.
- Lightweight Container Runtime (simplified cgroup and namespace execution similar runc, podman, or the Firecracker jailer).
- Host level execution using Systemd style security controls (Seccomp filters, system call filtering).
- Network device management.
- Block device management.
- stdout/stderr bus management (pipes) (logging).
- Node filesystem management (configuration files).
- Secrets management
There will be higher level subsystems of Aurae such as scheduling that will be stateful wrappers around the core. However the core remains fundamental to the rest of the system.
For example a Kubernetes deployment may reason about where to schedule a pod based on the status from Aurae. The decision is made and the pod is started using the Aurae runtime API directly.
The project is small, however the project is free and open. We haven’t established a formal set of project governance yet, however I am sure we will get there in time — especially as folks show interest in the work. For now the best way to get involved is to follow the project on GitHub and read the community docs.
We are literally just now beginning to draft up our first APIs. If you are interested in throwing down your unbiased technical opinion on Systemd we have a GitHub issue tracker waiting for your contribution today.
The project carries an Apache 2.0 license and a CLA in case the project moves to a higher level governing organization such as the Linux Foundation in the future.
Prior art and inspiration for my work includes the 9p protocol from plan9. As well as my previous work with COSI as a cloud alternative to POSIX. Perhaps the most influential inspirations for my work have been around a decade of my life managing systemd and the kubelet at scale.
Influenced by others before me: