The Vertical Immutable Infrastructure Pattern

Published in

omi-uulm

12 min readDec 22, 2020

This is the first of a series of blog entries describing our vertical immutable IT installation as well as the migration of a CoreOS/Rancher1.6 environment to a Fedora CoreOS/Kubernetes installation. The unique selling point of our set-up is that we do not use this tooling to manage applications on top of virtual infrastructure, but we use it to manage the physical infrastructure itself as well as virtual environments on top of it.

In this specific blog entry, we describe the old status of our set-up before the migration. We further detail our concept of running the system, our motivations why we run it this way, and how we have set it up. This enables one person to operate our 100+ servers in a fraction of his time. The entry concludes with a discussion of the lessons learnt.

Overview

We are a relatively small research institute at a German university (15 PhD students). Our core research activity centres around orchestration, automation, and performance engineering IT installations (think software defined infrastructures).

Our hardware infrastructure

Despite our small group size, we run around 100 servers and manage two student labs. The servers are used for supporting our teaching and research work. We run multiple file servers, multiple GPU-backed servers for data analytics and AI, and multiple runners for the CI/CD activities in our gitlab. This part of the servers are located at one side of the university campus. The other servers are co-located with the HPC compute resources on a different part of the campus and operate a cloud infrastructure. All servers on one location are connected with 50 GBits and 100 GBits Ethernet fabric and connected with 10GBits to the university backbone.

The labs are used by students for working on their theses and for hands-on tutorials during the lecture seasons. Besides, we run a bunch of IoT devices (~200) for which we established a similar operational approach as detailed in the following, but this is subject to a later article.

Our Software Services

On the infrastructure level, we operate a myriad of different applications and tools. These range from basic services as NFS, Samba, DHCP and DNS over Backup, gitlab, mattermost, and ElasticSearch to rather complex installations such as Apache Spark. In particular, the second server cluster runs a full-fledged OpenStack installation.

In the labs, we manage Linux-based Desktop machines the students use for their work.

Requirements

In contrast to the amount of servers, there is a single employee formally responsible for running the entire infrastructure besides other tasks that person needs to fulfil. PhD students can support, but are usually more interested in pushing their research than administrating IT infrastructure. Hence, the team agreed that any time spent not working on the infrastructure administration, but on research is time well spent.

A few years ago, immutable applications were a trending concept introduced for applications operated on virtual infrastructure. Here, the immutability part would mean that applications are deployed as a whole in virtual infrastructure. While there are different ways this can be realized in practice, the takeaway message is that for implementing changes to your application, the virtual machine running the application is replaced with a new virtual machine running the updated software version. Users should never log in to any of these VMs to perform any manual changes, be they software updates or configuration changes. This concept is heavily inspired by the PhoenixServer concept of Martin Fowler and does not rely on containers.

Even though the immutability concept was introduced for applications running on virtual infrastructure, we push it down one layer and apply it to our physical infrastructure. This means the approach needs to deal with physical servers, storage and switches, elements that cannot simply be destroyed and go away.

Hence, the solution has to fulfil the following set of requirements.

In order to take into account the limited amount of humans resources, the approach should require very little maintenance effort. In particular, this means that in case of failures, it needs to be possible to very quickly move the server back in a known working state.
Considering the different types of IT environments we are facing (servers and Labs), the approach needs to work for headless servers as well as for Desktop PCs.
The approach needs to provide reasonable flexibility. In particular, the servers are not single purpose machines. Their assignment may change depending on the researchers’ needs, the time of year (semester or not) and the time of day.
This also means that the approach should not only tackle servers as such, but be also capable of taking into account the network equipment interconnecting the servers as well as connecting our clusters to the university campus.

Even though not entirely necessary, we further introduced the requirement that workload (applications) should be decoupled from operating system management. This has a few practical implications that make the system easier and more flexible to manage.

Technology Wrap-up

At the core our infrastructure is operated using network boot-up enhanced by a stateless, containerized operation following the immutable infrastructure paradigm.

On the one hand side, this way of operation guarantees that nodes with identical functionality are configured identically and configuration drift is ruled out. On the other hand, it also ensures that after things have gone wrong (we are researchers!) or failures have occurred, nodes can be brought back in state known as working. This capability is also important for achieving reproducibility for experiments.

Overall, the immutability in our system is built on three pillars: (i) an infrastructure description and PXE boot; (ii) containerization of software components and a container orchestrator; (iii) a persistence layer that persists the minimum amount of state across server reboots.

PXE Boot and iPXE Boot

Preboot Execution Environment (PXE) is an interface to remotely configure a server at boot time. In particular, it allows to load operating system images over the network. Traditionally, this approach has been used to load the image of an installer and to run the software installation. An extended step is to run the installation image as well as an installation configuration remotely so that the installation can run automatically without any manual interception.

Later boot cycles would then access the hard drive and load the installed operating system. In our case all non-Desktop machines take this approach one step further: Installation instructions are executed at each machine boot and the actual installation step performed does not install to a physical disk, but to a volatile RAM disk. As installations are performed more often in that scenario, it is important that the installation process is quick and that the size of the installed operating system does not consume too much space on the RAM disk. What is more, injecting the configuration into the installation process should be supported. As described below, in our case, it is also crucial that Docker (or any container runtime) be natively supported.

Container Linux

Building on previous experiences and the requirements, we use the (by now outdated) CoreOS Container Linux operating system to run the physical installation of the testbed. Container Linux is limited in size, has native support for Docker as well as file based installation, and achieves short installation times.

For Container Linux, configurations to be applied during installation are specified as .yml files. These files describe the setup of the system including aspects such as network devices, IP addresses, daemon services to be started (based on systemd), files to be created, etc. During bootstrapping, the installer applies this configuration and creates the specified files and systemd services. As PXE per se does not allow the shipment of multiple files (including the configuration files), we make use the iPXE extension.

Bootstrap Sequence

In total, the entire bootstrap and orchestration process relies on a sophisticated orchestration of protocols and tools as shown in the following figure (Orchestration of Bootstrap process using DHCP, TFTP, PXE, and iPXE
Containerisation and Orchestration).

Orchestration of Bootstrap process using DHCP, TFTP, PXE, and iPXE
Containerisation and Orchestration

In the first step, once the server has finished its self-check, the BIOS PXE loader is started initialising the NIC including the use of DHCP for requesting an IP address. The response from the DHCP server contains boot information in addition to the IP address. In particular, it contains the fields next-server and filename, the location where to get the kernel image to boot the servers. The image that is loaded in that step is an iPXE boot loader. Once this bootloader is booted up, it will again initialise the NIC and request an IP address using DHCP. This time, however, the DHCP server recognises that this is an iPXE request and responds with an HTTP-based filename to load.

For our set-up, this file (loaded through HTTP) is an iPXE script that applies iPXE commands to fetch kernel, CoreOS configuration file as well as init ramdisk, and starts the boot process. The configuration file used for that purpose is dependent on the MAC address of the server’s NIC address.

Matchbox

We use the matchbox toolset to realise this step of the operation chain. Matchbox consists of a simple web server and a thin wrapper layer that allows serving ramdisks, configuration files, and kernels based on the requesting client; i.e. different servers can get different configuration files, but also different kernels, if required.

This remote booting of fresh operating system installations based on iPXE and Matchbox in principle allows the operation of complex software architecture provided that the software components are encoded in the configuration file for the installation process. This process, however, is very invasive, when it comes to applying software updates or small configuration changes such as moving a specific service to a different physical node. In particular, the latter would require that a total of two servers be taken offline and rebooted.

Orchestration

Hence, a different approach is needed that allows more flexibility, but still supports reproducibility and consistency of software and configurations across servers and locations. Containers are able to provide the necessary stability. Specialised container images are built at a dedicated point in time (usually from a base image) and then stored in an image registry. From there, they can be accessed and installed on a server. As an individual image remains unchanged over time, the deployment of that image always yields the same software artefact, provided that the following pre-requisites are fulfilled: (i) the image itself is static and self-contained and it is not necessary to explicitly or implicitly install further software into the image once it is deployed; in particular, this requires that the image either contains the full configuration or the entire configuration is provided immutably at the start of a container; (ii) the naming schemas for a logical image provide a meaningful versioning; particularly, adding a new server to an installation should not install the latest version of an image that may be currently used for testing, but instead the very same stable version all other servers are using as well.

As with individual software packages, the iPXE/matchbox bootstrap mechanism can be used to deploy images on the servers. Yet, this still leaves us with the problems described above. Instead, container orchestration can help to provide a centralised management of the entire container-based installation of the software landscape. It further provides a centralised access point for software upgrades (replacing an image with an updated image of the same logical component). Finally, container orchestrators provide additional services such as DNS that ease the wiring between containers and hence, the creation of container images . While multiple orchestrators exist that are equally suited for our demands, the combination of Rancher and Rancher Cattle provides a sweet spot between complexity and features compared to more complex environments using Kubernetes/OpenShift and simpler environments such as Docker Compose and Docker Swarm.

As shown in the lower parts of the previous figure, once the operating system is booted, the server will contact the Rancher server and receive the set of Docker containers to run. The Docker images needed for that sake, are pulled from one or multiple Docker registries. Rancher itself is hosted on a dedicated additional node (the Cloud Deployment Orchestration (CDO)) to be described a separate article. This node also runs the DHCP server and Matchbox instance required for PXE booting. Using Rancher, services and service stacks are deployed from Docker compose files that describe the individual services, their respective configuration, and finally also their interdependencies.

Infrastructure Description

The mechanism based on PXE, iPXE, and matchbox described above enables us to boot fresh and tailored operating system installations whenever a server is restarted. A pre-requisite for that is, however, that configuration files are available for each and every server (while Matchbox supports more general configurations these have not proven useful in the past). While a manual creation of these files is possible, it is an error prone, time consuming step, and highly repetitive task.

For that purpose, we use an additional generation step that creates the files based on an infrastructure description. This infrastructure description comprises the servers including their names, NICs, as well as the wiring of the NICs in the network topology. This information is then fed into a parser that generates an installation file per server based on the server-specific information and adds general information such as user accounts, baseline services, and files. This process is sketched in the following figure together with other artefacts used in the Matchbox and TFTP servers (Configuration files and artefacts used in Matchbox and TFTP servers).

Configuration files and artefacts used in Matchbox and TFTP servers

Persistence Layer and State

The entire infrastructure set-up has been designed to be immutable and hence to have no volatile state. In consequence, all state we have discussed so far is either contained in the configuration file for the installation of Container Linux or in the Docker compose files stored in Rancher. To a very large extent, this information is sufficient and self-contained. However, there are smaller pieces of state at several locations that are either necessary persisting to ensure stability or worth persisting to increase deployment speed and recovery time. We briefly skim these aspects in the following. They are detailed in a separate article.

Being one of the two central elements of the set-up, the state of the CDO node is crucial to the functioning of the entire approach: For (i) PXE boot, it contains, the information of DHCP entries and on server configuration in Matchbox as well as the operating system images loaded during boot-up time. For (ii) deployment, it contains credentials for the Docker image registry and stores deployment configurations (services and service stacks) as well as information about which containers are running on which nodes.

For being able to recognise nodes when they come back online after a reboot or a failure, Rancher uses a UUID assigned per node. We persist this information on the nodes to speed up the time required to re-introduce nodes after a restart. Similarly, we persist the images as well as the states of the containers currently running on a node. This helps to improve recovery time as it avoids that all images have be downloaded from the registry. This improvement lowers the bootstrap time.

OpenStack Deployment

On top of the hardware we operate a private OpenStack cloud operated the Rancher orchestrator. OpenStack is a multi-tenant management software for Infrastructure-as-a-Service (IaaS). It provides access through a Web-based Dashboard or via a REST interface. At its core, OpenStack provides users the capabilities to operate a virtualised infrastructure. The OpenStack project features a few dozen sub-projects that realise OpenStack features . These have different status and maturity. In our installation, we run the following highly mature services:

nova (compute service): Provides the mechanism to start/stop/migrate virtual machines. Requires the operation of a hypervisor on the compute nodes.
glance (image service): Provides the capabilities to store virtual machine images; and to snapshot running virtual machines.
cinder (block storage): Provides iSCSI-based block storage that can be mounted to virtual machines.
neutron (networking): Provides virtual networks and shields the data of different tenants from each other.
keystone (identity service): Provides the central integration point for user and project management. Allows to set quotas and add new users.
horizon (dashboard): Provides a user-friendly Web-based GUI for accessing OpenStack

While we are currently running OpenStack Victoria, we have been operating OpenStack in an immutable way since 2017 and have also used the containerized approach to seamlessly update between different versions. The installation process can be seen in this video.

In order to host OpenStack, all physical nodes in the set-up run on CoreOS as a minimal operating system. On top of this, compute nodes run KVM as a hypervisor. Further, all compute nodes run an instance of a nova agent that orchestrates KVM (e.g. triggers the start of a virtual machine) as well as an instance of a neutron agent that orchestrates the network configuration (e.g. create a new virtual network). The remaining nodes do not run hypervisors. The storage node runs Glance and Cinder. The control node runs nova’s and neutron’s orchestration services as well as Keystone. The management node operates the Horizon Dashboard and a Reverse Proxy to support SSL Termination. Following our deployment schema, all of these applications run inside Docker containers.

Summary and Outlook

Summarizing, running our infrastructure following the vertically immutable pattern has helped us to increase productivity of our whole IT infrastructure despite the small man power we have available for managing it. Using containers and versioning further has further improved the overall stability and resilience of the set-up.

Further articles will consider bootstrapping of the CDO node and also detail the OpenStack deployment in more detail including the management of persistent state.