You Could Have Invented Container Runtimes: An Explanatory Fantasy

Trevor Jay
7 min readAug 1, 2016

--

Disclaimer: This fable is a personal work of fiction by Trevor Jay. Blame should be assigned to no other person or entity for its content. Zac Dover contributed purely editorial and grammatical oversight.

Imagine that you’re an old-school working programmer. That means that you write code for its intrinsic value a business, not for resale or as an investment vehicle.

Now let’s imagine that you work for a large clothing company based in one of the flyover states. The company has already modernized: it has a large working ecommerce and online-business-to-business presence. Further, let’s imagine that the company has decided to dabble in fitness tracking for its senior activewear line. The hardware is being handled by a partner company, but the actual online services will be up to you.

As you look into the issues, you realize that your existing infrastructure can handle most of this task. The sticking point is the realtime tracking aspect. You talk to your devops team, who — -being an old school programmer — -you annoy by referring to as system administrators. You decide together that the most straightforward solution is a simple high-availability load balancer in front of microservices. Those microservices will run on your company’s private cloud and call out most of the work to your existing infrastructure (databases and so forth). Despite some confusion as you keep referring to the private cloud as a “cluster” and the microservices as “request brokers”, the plan is approved and you write a set of simple microservices as traditional C daemons. That is: they start as root, are limited by seccomp, do their business (like low port binding) and then drop almost all of their kernel capabilities before becoming “normal” user processes. Your devops team writes some init files for each of the cloud nodes and everything works very well.

Everything continues working well. As a matter of fact, things go so well that everyone realizes that the company has a hit on its hands. Your CTO offers funds for public cloud resources. However, there’s a catch. The cloud vendor only supports a Linux distribution you’ve never heard of. Luckily, the init system is the same. However, not much else is. In fact, the distro has a weird and incompatible glibc. Unfazed, you send your devops team new statically compiled versions of the daemon binaries and you’re in business with a hybrid cloud solution.

Time passes, and the backend begins to change. The ecommerce team has decided to to abandon XML for JSON. It’s all just serialization to you, but there’s an issue: the JSON library that you’d prefer for the microservices to use is difficult to compile statically and isn’t the same across the two distributions that your public and private cloud resources use. Your solution is simple. You ship your devops team both (1) the binaries and (2) a tarball containing the library that they unpack to a set directory “for on-cloud”: ‘/var/lib/foc’ . When launching your services, a quick definition of ‘LD_LIBRARY_PATH’ in the init file puts everything right again. Your otherwise static services use the untarred versions of the library that you simply copied from your favorite distro.

More time passes, and people begin to suggest features that would best be implemented by moving some of the data processing currently handled by the back end to the microservices themselves. Ideally, you’d like to be able to reuse the existing backend code, but this presents a larger version of the dynamic library challenge. These components were written in the modern “just import everything” style so popular with developers born after the fall of the Berlin Wall. Not only do those components depend on many different operating-system and language-specific packages, they also depend on distro-specific directory layouts and UIDs. A ‘chroot’ is the traditionally portable solution (between the various *Nixes), but you know from experience that the portable UID switching options are limited. Reluctantly, you decide to bet entirely on Linux and use the Linux-specific ‘pivot_root’ and ‘unshare’ calls. Surprisingly for technology that isn’t even as old as your eldest son (being introduced in kernel 2.6.16), this approach works well. In the tarball that you give to your devops team, you now include “kernel external resources” “for on-cloud” that they install in the set directory ‘/var/lib/focker’. These files basically amount to a subtree of the company’s preferred distro. Now instead of using the ‘LD_LIBRARY_PATH’ hack, your static binaries simply uses a combination of ‘pivot_root’ and ‘unshare’ to switch users and change roots. This supplies the dynamic library that your subprocesses need for JSON and ensures that the previously-written components work just as they always have. Now that everything works “up in the cloud”, you are able to extend the older components to add the new features that you need.

The cloud solution works well and devops are able to add and remove nodes as usage waxes and wanes. However, your CFO still isn’t too happy with the datacenter bills and would like you to find a way to make better use of cloud resources. You decide that an obvious solution is to run multiple copies of the service per VM, but you’d rather not recode anything and your current services aren’t very configurable. The services basically assume that they can bind to specific ports on all interfaces. Since you’re already making ‘unshare’ calls, you decide that instead of just running your microservices in their own mount and user namespaces, it would be easy enough to give them their own network namespaces as well. You extend the binaries that launch your microservices so that they take two IPs as arguments, which they make the two ends of a virtual Ethernet (veth) device. Now each instance of a microservice gets its own IP and can bind to whichever ports it wants just as it always has. The plan is to give each of the N service instances an IP on a private network (such as 172.17.0.0/24). The IP will correspond to an external port, so whereas before the load balancer was forwarding only to port — -say — -8000 of a machine, it now knows it can try 8000, 8001 … 80*N. You give your devops team a few iptables rules and sundry to incorporate into their unit files to forward traffic to these IPs based on external port and now you can easily run dozens of instances per cloud VM.

This scheme works so well that the other developers in your company begin using it to run other services. However, because each of these “fockers” requires almost a full distro in a tarball, devops is unhappy with the disk usage. Realizing that most of the “wasted” space is just a copy of the same distribution specific files, you suggest that instead of a straight tarball with everything in it, devops just copies a full install of the OS to each node and developers instead ship the “difference” (the files they add) as a tarball. You change your binaries so that they combine these two “layers” via mounting them on a copy-on-write filesystem before the crucial ‘pivot_root’ call. After a few iterations on this concept, you strike upon a system wherein the tarball includes a “layer manifest” of hashes. Each hash represents a layer in the form of a tarball and they are expected to be stored in ‘/var/lib/focker/hash/’. Now devs can reuse any layer, not just the base OS one. You even add a nice little feature where if a particular ‘/var/lib/focker/hash/’ is missing, you try to ‘curl’ it from a specific central location on the company’s internal web-server.

This internal ‘focker’ tool is now robust enough that you could have written your original microservice as a process to be run in the containers it creates. So you do. Your set of binaries is replaced by a single launcher that executes stripped-down versions from within the container. Your launcher is now a completely independent, statically-linked C binary to which you merely pass container runtime information (this includes IPs, which tarball to pull in order to retrieve the manifest, etc.). For maximum flexibility, you also make the secomp and capability-dropping runtime configurable, so that other developers can use them. Your real development is now solely in these “focker” tarballs that, since they are identified via hash, you can store and pass around freely.

A few months after this system has been adopted more widely, you get a call from devops. It turns out that not everyone writing a service is as disciplined as you. A few errant services were running amok, using all of a given node’s resources. To solve this problem, devops started assigning each container to a unique random cgroup at init time. However, they feel this would be better handled in the launcher itself. You would normally be reluctant to use a feature as young as cgroups, but namespaces worked out and devops hasn’t run into any issues, so you make the change. Since it seems parsimonious, you add not only the ability to pick a cgroup with a argument flag, but a SELinux type as well.

Another lingering issue is that every developer seems to need another special device or file (‘/proc’, ‘/dev/random’, etc.) Rather than add them all, you take the lazy way out and allow for arbitrary files and directories to be bind-mounted at launch time. This feature ends up being useful for more than just special device access. It allows for a whole new class of “semi-stateless” apps.

A little less than two years later, focker is such a hit that your junior dev (the one with all the ponies on his desk) keeps pressuring you to open source it and turn it into a business. You dismiss the idea out of hand. You tell him that it wouldn’t be worth it. “It’s just a thin shim on Linux kernel features,” you tell him. “Also,” you add, “I’m a married woman in her forties…” He is about to speak, when you cut him off. “And it’s written in C.” He stops as his facial expression changes completely. “Good point,” he says looking back to his screen.

--

--

Trevor Jay

is a security engineer and an enthusiast of retro-computing and pinball.