Immutability is Your Friend

John Easton
AI+ Enterprise Engineering
11 min readMay 19, 2021

How building your cloud on an unchanging foundation is key to cloud success

Photo by Simon Hattinga Verschure on Unsplash

The Cloud Engagement Hub works with clients, helping them exploit cloud. We frequently see organisations struggle as they start to embrace cloud. They try to use their existing IT management processes, and these are often not fit for purpose. This doesn’t mean that they are wrong. They just struggle to cope with the speed and scale that working in the cloud requires. One of the key concepts to come to grips with is immutability.

If something is mutable it is capable of being changed. Most traditional IT environments are mutable. Changes are made to servers in place. These might be due to upgrades, patches, reconfiguration, or many other common tasks. It’s worked for decades, so why should we change it now? This article looks at immutability and how this helps simplify cloud operations.

We should start with the Pets vs Cattle analogy. I first came across this in the early twenty-tens when working with a team at CERN. Let’s consider the “pet” approach to server configuration first. Servers get installed with an initial operating system and software stack. If you need to do this for many servers then your effort grows. These servers are not static of course. Updates, fixes, and the like need deploying over time. We could, of course, automate this process using tools like Ansible to reduce effort. The challenge comes when, the install or update doesn’t quite work, or something goes wrong. With multiple reasons for things to fail, servers can end up in many different states. It may only happen infrequently, but when it does, it can be a hard to troubleshoot and fix. With a server in an uncertain state, the next upgrade or fix could leave it even more divergent from its peers. Over time, our homogeneous pool of ‘pet’ servers ends up in many different states. Many organisations have thousands or tens of thousands of servers. Even if updates work pretty much all the time, this problem is both significant and growing over time. The technical debt problem suddenly got a lot worse.

With immutable infrastructure, there is no patching. Instead, a new clean version replaces what was there before. If something goes wrong with the replacement, we start over. This reduces risk and complexity. Risk: because we know what state any resource is in at any moment in time: the provisioning either worked or it didn’t. Complexity: because there are far fewer potential states that the total pool of servers can be in.

So why don’t we do this? Simply, because this requires people to work in different ways. Systems need designing differently to how they might have been before. They need managing in new ways too. Many of us have logged into servers, edited config files, read logs etc. when problems occur. If so, this approach may well be an anathema. If we always install a database. If we define storage in the server to hold application data; again, we need to change this way of working. Think about it. If I replace the server then I’m going to lose the data. So, I need to start ensuring that the data my application will use has to sit external to the server. Not a major difference, and one that many companies already do, but a change, nonetheless. Often, cloud makes this easier because I consume a database service, or I write data to object storage. I’m externalising my data because that is just how the cloud works.

Immutable infrastructure

Let’s start at the bottom of our stack with immutable infrastructure. As we have seen, immutability improves reliability and agility. Using automation, you deploy new systems to a defined, consistent standard. This frees up operations staff to work on more valuable things. If something breaks, deploying a new clean image can rapidly replace it. This helps you to keep application workloads running with less effort.

When talking about immutable infrastructure we should separate “technology” from “ways of working”. That said, using an operating system designed to support immutability certainly helps. There are several immutable operating systems to choose from. Most are Linux variants. These include Fedora Silverblue, openSUSE MicroOS, Red Hat CoreOS and Talos OS. Most of these immutable operating systems exist to run containerised workloads.

Immutable operating systems have key filesystems such as / and /usr mounted as read-only. This is not a new idea. Unix systems have used this approach to support diskless workstations for many years. The key thing is to keep the bulk of the operating system separate from configuration data. The latter will vary between systems. An OS like CoreOS has relatively few modifiable systems settings. To install it, the OS image is downloaded to the target platform. Then, an “Ignition” file provisions and configures the system. You need to generate an Ignition file for each configuration you deploy. Automation provided via the Ignition file simplifies deployments of many systems.

If your infrastructure is consistent, this simplifies testing. Using automation becomes simpler. The chance of introducing errors when moving from dev to test to production reduces. So why isn’t everyone doing this? Often, it is just because it’s different from we did before. It is true that some of the tooling requires growing new skills. The key issue though, is often an unwillingness to change.

You could potentially use any off-the-shelf OS and automation. This would help you to achieve similar benefits to those above. However, using an immutable operating system will help force a new way of working. What is true for an OS is also true for a VM. Rather than trying to scale-up an existing VMs, often it is easier and faster to just add more VMs. This is especially true for a platform that consumes immutable VMs like OpenShift.

Below is a side-by-side comparison of immutable vs mutable infrastructure.

Mutable vs Immutable Infrastructure

* This is a change to existing ways of working. However, it acts as a forcing function to a better future so it’s a plus for immutability.

Immutability and containers

Building on our immutable infrastructure, the next ‘layer’ is our containers. By definition, immutable containers don’t change during their lifetime. Every deployed container running a given image in your environment is identical. You can use the same image can in different deployments. If so, it likely needs a different configuration. Whether using ConfigMaps, Secrets or some combination, this data must be external. It cannot live within the image itself.

You can, of course, update images to use a newer software version or add new capabilities. This creates a new image with the previous image unchanged. You update your running environment by deploying the new image. If you need to back out those changes, you can redeploy the old image. While we are talking about patching, consider how Java application runtimes have changed. WebSphere Application Server Network Deployment (WAS ND) was very much a “pet”. You would add apps, configure the runtime, patch WAS every release and patch the OS to match WAS. Compare that with the newer OpenLiberty where this is part of the container image. There is no patching of the OS nor WAS. You just code, build, deploy and run.

As with immutable operating systems, immutable containers improve security. A secured image never changes so it is easy to check its correctness. As we have seen, this image contains no Secrets or state able to be stolen. Even if an image were to be compromised, the impact of this is limited.

Containers often only run for a short time. As such, you might think they need less management than traditional systems. This is untrue. A typical system likely has many more containers than its VM equivalent. This can increase administrative load if you don’t use immutable images. Keeping many container images up to date becomes an overhead if not careful. Sourcing your images from a private container registry helps avoid introducing ‘bad’ images. But this is no substitute for verifying image integrity anyway. All images in your registry need updating to ensure they are secure and safe to use. If you have containerised an older application, the container could well run for a long time. Much longer than it’s cloud-native peers. If this is the case, then this becomes even more true. Oh, and remember that your container management layer — e.g. OpenShift — also needs managing too!

On that subject, your Kubernetes cluster is a candidate for immutability too. Managed Kubernetes services offered by cloud providers allow you to easily deploy new clusters. It is now easy to treat the cluster itself as immutable and run content in it. If something breaks, throw it away and get a new cluster.

Note: many cloud environments use both containers and VMs. This can give rise to challenges. VMs are often coming from a ‘pet’ mindset or heritage. This is most true when they have been lift-n-shifted to the cloud. If so, implementing them on an immutable layer may give rise to unforeseen issues. You may need to re-platform applications to an immutable VM image or make other changes to get this to work well.

Developers and Immutability

So, by now you may be asking yourself how do I develop on a platform like this when I can’t change anything? Developers often install specific tools, libraries, and components to do their job. Especially if the base operating system image doesn’t include them.

On Silverblue or CoreOS, I use “toolbox” to create a development environment. I can install my favourite tools and any other components I need to achieve a given task. This keeps my ‘messing about’ separate from the underlying operating system. I can create a toolbox for a different operating system or version to that of the underlying system. If I do mess up, or when I’m finished with my work, I throw the toolbox environment away. Operations on a toolbox are simple. I ‘create’ one to provide my base environment. I then ‘enter’ it to do my work. When I’m finished I can ‘exit’ the toolbox (ctrl-d) and finally ‘remove’ it if it’s no longer needed.

For much of the IT age, variables used by programming languages have been mutable by default. Immutability was only in the realm of functional programming or academic research. Now, we see several modern languages, such as Rust and Swift, turning this on its head. In these languages, mutable variables need to be explicitly declared. Once again, this paradigm shift requires the developer to work in different ways. If you are modernising existing code, it might force you to redesign or rewrite large portions of it.

Developers can benefit from immutability in many ways. You avoid nasty surprises when your code changes in unexpected ways. If writing parallelised or multithreaded code, you don’t need to synchronise immutable objects. This is not to say that there aren’t also costs here. To change an immutable object, you need to copy it. If the object is large, that copy might consume more time than if it were mutable and you could update it directly. You need to balance performance impacts against coding benefits. For example, caching immutable objects to improve performance.

Choose your programming language for the task in hand and use immutability wisely. It is not the answer for every coding problem.

Code obviously needs testing. How many of us, over the years, have spent hours debugging a strange problem? We only then find out it was the test system configured in an odd way that made it not work in production. Rapid deployment of clean and correct immutable test environments eliminates this issue.

Immutability and Operations

One of the key mantras about SRE that I particularly like is that if you’re required to log in to a server, you failed! For those of us who have spent much of our careers doing just that; this is somewhat thought-provoking. Another way to look at this is to consider the FTE:server ratios of the large cloud providers. These are often cited as being in the 1:10,000 range. If you have to login to servers, you’re never going to get remotely close to this sort of ratio. You may not have 10,000 servers to manage, but this shouldn’t make this less of an aspiration. Everything has to be automated and there is zero tolerance for something that isn’t working. Get it out of the way and replace it with a new one. Cattle vs. pets made real. And what are these FTEs deploying? Immutable systems, pre-configured and ready to go. Anything else, won’t cut it. Immutability is our friend when trying to scale operational tasks.

Does that mean that we don’t patch systems? Of course not, but the way we patch is to create a new image. This happens using automated build tooling and that then becomes our new deployment. You then replace your running systems with this new image. Blue-green deployment or rolling release is your choice. Updating all your images in one go is risky. Rather, you need a controlled deployment strategy. This allows you to test changes early and rollback if required. Remember that this is not just about patching containers. Each layer in the stack (OS, container management, etc.) has to be patched and maintained. This is more complex that the traditional server approach. But is needed to give you the benefits of an immutable infrastructure.

Security ensures the things we use are secure when deployed and stay secure through use. If our containers are immutable, once they are running, they never change. An image passing security verification tests in “build” can be trusted once deployed. If a container gets compromised at runtime, a fresh deployment will take it back to a known, clean state. Note that doesn’t mean we’ve “fixed” security once and for all. Containers should only contain functionality you use. Any OS components don’t need need to be removed to reduce the image’s attack surface. When monitoring for unexpected filesystem changes, these “reduced images” will help.

What about backup and recovery? Stateless containers don’t need a backup strategy for instances. You backup the images in the registry instead. If one fails, you deploy a new, clean instance. We do need though, to backup those external storage and database services that hold your data. You can use immutability here too, writing your backups to immutable storage such as a WORM drive.

Use of immutable containers requires logfiles be written to external storage. Typically, this would use the native logging capabilities of the platform. Monitoring those logfiles can be happen as usual. You can choose to use dashboards or other tools to help visualise the system. Immutability allows us to more easily use automation, however. Whether responding to events or to simplify ongoing management, it makes this simpler. There are fewer variable factors that we need to take into account. And our cattle vs pets approach tells us that replacement rather than fix in place is the right approach. That helps for containers and underlying infrastructure. What about the applications though? This is done in the same way as we would for non-containerised applications, so no changes needed here. Or you can use modern tools like Instana to achieve this.

Conclusions

As we have seen, we can use immutable concepts at all levels of the “stack”. Benefits occur most in the infrastructure layer. However, we can see more of these with applications and on to security and operations. These come at a potential cost though. Introducing immutability requires people to change the way that they work. There are likely also changes in tooling and processes too. That said, the benefits clearly outweigh the potential costs and challenges. Immutability is a powerful tool in your arsenal when moving to an automated and agile future. It may not provide all the answers, but it’s an important piece of the puzzle.

--

--