A Spectrum of Mutability
I noted in my previous post that 100% read-only infrastructure is a nice aspiration to keep in mind, and I stick by that. Practically, however, it may not be so sensible in all parts of a system. Consider the example of adding a rule to a Security Group in AWS — a test needs Ping, say, so ICMP needs opening. That’s a mutation, and it could break something else through some subtle interaction. So what’s the solution? Make a whole new AWS account with another copy of the whole system in? Deploy all my services and throw the Big Switch on an ELB to move the traffic across? Well, I should have a single source of truth in git which declares everything necessary to rebuild my system. So it’s easy, right?
To me there are a few problems with this approach. Even if this process is completely automated, it’s very complicated and that brings risk. Even if it seems quick in human time, quick compared to doing it manually, it’s orders of magnitude slower than starting a container. Even if you think everything that’s going to come up has worked together before, there are so many moving parts I wouldn’t want to bet my production systems on there being nothing not captured by my descriptions in git. There are second-order effects too — the caches in my new account will be cold, the moving averages for my monitoring and alerting will be zero, etc. And (promise you won’t tell anyone) stateless systems are a bit of a lie. Maybe a user is going through the checkout process, which isn’t serialised anywhere permanent until it’s done. While I might accept the risk that they’ll have to restart that transaction if I have a catastrophic failure (and blame their internet connection), I don’t want to make it happen unnecessarily.
What I suggest is a more pragmatic approach, based on a “spectrum of mutability”. I think about my stack from top-to-bottom as Containers, Orchestrator, OS, Hardware, and I accept more mutation the lower down I get. While I replace a containers for any change, I do minor host OS patching in place. Think of RAID — if well-managed, an array will continue to function and present the same functionality as before, with each layer noticing less and less as we go back up the spectrum (the kernel has to do actual work, service instances may notice degraded performance as the array rebuilds, users should be load-balanced off slow instances and not notice at all). You wouldn’t build a new data center to replace a blown disk.
Building on my previous post, I still see the standard configuration management tools as a great fit here — chef, ansible, et al, as well as newer tools like terradiff, ansiblediff, etc. At some level your systems are always mutable — you should have a break-glass mechanism, and you better believe a cracker can mutate things if they want to. These tools can spot and revert changes whenever you don’t want them to occur. What changes down the stack is intent. At the “immutable layers” you promise yourself that you’ll never change anything — never change the targets you declare to those tools. On the lower, “converging layers”, some movement of the targets is ok.
In my next post, I’ll discuss some strategies for managing this.