Stories by Matt Turner on Medium

Latching Mutations with GitOps

Matt Turner — Sun, 11 Feb 2018 14:35:08 GMT

Immutable Infrastructure has become a hot topic recently. I’ve written a couple of posts about it, and I think the term should be more strict than how it’s usually used. In my opinion, total immutability of infrastructure is a good aspiration, but not very practical.

The definition of “infrastructure” itself is blurred. Your app devs are now operators; they operate their own code, on top of a platform you provide. They specify the version and size and number of the containers running their product code. That’s their infrastructure, and no-one would argue that a new cluster should be stood up every time they want to push a new version. The raison d’être of the cloud-native movement is to enable them to do that more and faster.

No system really can be fully immutable (are you writing everything in Haskell?). Cindy Sridharan notes that entropy will always build up, and one major source of that is the churn of the apps running atop your platform. It makes sense to let these apps change in place. (Immutable Architecture is a different beast — never changing the set of services provided by those apps by e.g. using protobuf to make sure the total API set only grows).

In response to a new build of an app, or adding one to its replica count, its cluster can be completely replaced with one containing the new version/scale, or it can be mutated in place (i.e. Pods replaced or added). While the latter might seem eminently more sensible, whichever you chose is kind of irrelevant to the argument I’m about to make. That said, I think it’s important to talk about the following in the context of the current conversation around immutable infrastructure.

Alexis Richardson has been posting a phenomenal series about “GitOps”*, providing great processes for controllable changes to infrastructure. Kelsey Hightower has spoken about applying the same principles to app deployment — a separate “infrastructure” repo for the Kubernetes definitions behind your apps, and deployments thereof by Pull Request.

*(In short: his thesis is that everything you run should be declared in git. Automated tooling keeps your clusters in sync with that single declaration of truth. All changes are mediated and discussed through Pull Requests coming in from dev branches.)

If a cluster catches fire, so be it. A new one is started, and Weave Flux re-deploys everything that was previously running, because it’s all declared in git. Right? Well, should everything about the system be declared in git? My first reaction was “yes” — declare everything in git, bring it all under control. But what about something like application scale? We can guess at this a priori, but it’s ultimately a function of the environment — of actual user traffic rates — not of some engineering best-practice. And we certainly don’t want it done ad-hoc, with a dev watching CPU loads in grafana and raising a PR every minute.

Let’s consider the opposite: what if scale isn’t declared at all? Kelsey Hightower has said it shouldn’t be, so that an HPA can be used. But what if a system has traffic necessitating 10,000 Pods? If that cluster needs recovering, the result will be a Deployment of one Pod. That will be totally overwhelmed by the traffic, probably compound the problem by failing its healthcheck, and certainly offer no useful service to its users.

So I assert that we do want the scale declared in git. And, although the required scale is a function of the environment and can only be known empirically, that loop should be automated too; this is the essence of DevOps. Consider a tool that watches the Deployment and auto-commits each new scale (like a reverse Weave Flux). Even with a separate (app) infrastructure repo, that would be so noisy that actual version upgrades wouldn’t be easily spotted.

With dynamic properties like scale, being roughly right is good enough. The CPU target is always 70 or 80%, so there’s headroom. It’s sufficient just to declare a nearby round number: a multiple of 10, or an order of magnitude. This is what I suggest; auto-committing the closest round number of your current scale. This will get the system back to a place where it can cope. It might be a bit slow, or a bit wasteful, but it won’t die. Declare enough to get the system back up with one click, and let the HPA take the fine-tuning from there.

From a manageability point-of-view, this “latching” behaviour keeps systems declared well enough in git, whilst not overloading operators with commits so numerous that they cease to have any value. This way, for example, they still function as audit logs — 3 users but a replica count of 10k probably means a computational complexity problem (or DoS attack) deserving attention. The automated tool could even PR each latch so it can be eyeballed to decide if its intentions are pure.

In GitOps terms, the “desired state”, i.e. that declared in git, is a rollback checkpoint; some things are meant to change, but if those changes go wrong, git will always describe the last, good, consistent state that you should go back to. All I’m saying is that a scale from 1 to 10,000 is something that’s material and should be checkpointed along the way. Think of it as a write-back cache maybe.

Clearly tools like kubediff either need to ignore this field, or understand the round-numbers latching policy.

Minimum scale should still be specified (it’s a function of your users’ SLAs, though it lived in the infra repo not code repo, as it’s the empirical result of that SLA married to a load test). Similarly, max scale can and should also be specified, again as a result of load testing (the point at which you’ve determined that 2nd order effects and the Universal scalability law kill you). These bounds are a function of the users’ requirements and the codebase, whereas run-time scale results from the environment.

As a further example, take blue-green rollouts. If a cluster is recovered from git that was an unknown way through a roll-out, what state should be recreated? It’s wasteful to go back to 100% v1, if it was 90% through upgrading to v2. Conversely, it’s unsafe to go all-out with v2 if the scant 1% that had been rolled out had failed their health-checks. I posit that the in-flight ReplicaSets should be watched their major progress milestones latched in git.

In conclusion, changes are inevitable. Whether you scale apps by adding more Pods to an existing cluster, or even if you do make a whole new cluster of n Pods every time, the problem is the same: some changes have to happen in response to the environment, rather than by operator diktat. Even with a mutating cluster, for purposes of recovery, audit, and easy roll-forwards, you still want an up-to-date description of every material aspect of it in git, but without overwhelming your tooling or operators. By latching, you capture the important details, while being pragmatic about the amount of incidental churn you want to be bothered by.

matt. @mt165pro

Many thanks to Hannah Morris for editing and Alexis Richardson for reviewing.

A Spectrum of Mutability

Matt Turner — Wed, 11 Oct 2017 09:10:40 GMT

There’s been a lot of discussion recently of “immutable infrastructure”, roll-forwards, etc, by myself and others, including Alexis’ seminal post on GitOps.

I noted in my previous post that 100% read-only infrastructure is a nice aspiration to keep in mind, and I stick by that. Practically, however, it may not be so sensible in all parts of a system. Consider the example of adding a rule to a Security Group in AWS — a test needs Ping, say, so ICMP needs opening. That’s a mutation, and it could break something else through some subtle interaction. So what’s the solution? Make a whole new AWS account with another copy of the whole system in? Deploy all my services and throw the Big Switch on an ELB to move the traffic across? Well, I should have a single source of truth in git which declares everything necessary to rebuild my system. So it’s easy, right?

To me there are a few problems with this approach. Even if this process is completely automated, it’s very complicated and that brings risk. Even if it seems quick in human time, quick compared to doing it manually, it’s orders of magnitude slower than starting a container. Even if you think everything that’s going to come up has worked together before, there are so many moving parts I wouldn’t want to bet my production systems on there being nothing not captured by my descriptions in git. There are second-order effects too — the caches in my new account will be cold, the moving averages for my monitoring and alerting will be zero, etc. And (promise you won’t tell anyone) stateless systems are a bit of a lie. Maybe a user is going through the checkout process, which isn’t serialised anywhere permanent until it’s done. While I might accept the risk that they’ll have to restart that transaction if I have a catastrophic failure (and blame their internet connection), I don’t want to make it happen unnecessarily.

What I suggest is a more pragmatic approach, based on a “spectrum of mutability”. I think about my stack from top-to-bottom as Containers, Orchestrator, OS, Hardware, and I accept more mutation the lower down I get. While I replace a containers for any change, I do minor host OS patching in place. Think of RAID — if well-managed, an array will continue to function and present the same functionality as before, with each layer noticing less and less as we go back up the spectrum (the kernel has to do actual work, service instances may notice degraded performance as the array rebuilds, users should be load-balanced off slow instances and not notice at all). You wouldn’t build a new data center to replace a blown disk.

Building on my previous post, I still see the standard configuration management tools as a great fit here — chef, ansible, et al, as well as newer tools like terradiff, ansiblediff, etc. At some level your systems are always mutable — you should have a break-glass mechanism, and you better believe a cracker can mutate things if they want to. These tools can spot and revert changes whenever you don’t want them to occur. What changes down the stack is intent. At the “immutable layers” you promise yourself that you’ll never change anything — never change the targets you declare to those tools. On the lower, “converging layers”, some movement of the targets is ok.

In my next post, I’ll discuss some strategies for managing this.

Immutable definitions

Matt Turner — Fri, 22 Sep 2017 10:34:25 GMT

I’d like quickly to talk about my definition of immutable infrastructure. I’ve heard the term used recently to mean a few different things, and it’s caused people to talk across purposes. I’m not saying I’m right, but here’s one definition for discussion.

Let’s recap on why we want immutability. Why not update software in place? Partly because it’s very hard to do that without down time. Partly because in-place file and database schema updates are hard, especially testing the combinatorial explosion of all the versions that might need to transition into each other. But also because state is a pain. The more state you build up, the more chance of it being erroneously structured, or pathologically big. There’s a reason systems like Erlang and Akka just flush state as their first attempt to fix a crash. State also doesn’t have to be deliberate — maybe you write logs up to the rotation maximum somewhere, then update to a different app version with a different log location. You’ve just accidentally doubled your quota; cue a weird bug due to a full disk in three months.

To me, immutability of a container or VM doesn’t just mean I don’t intend to update the software in place. It means I want to ensure nothing about that environment changes. All state should be off-box in a hosted DB. All logs go straight to Elasticsearch without touching the filesystem. All config comes through the environment, not a file. Immutable also means not hacking up fixes in place — it means having everything that describes an app and its environment in version control, baking and pushing new images for every change, à la gitops.

Why not take steps to prevent mutation? An app like the one I described above should be able to run in a container with its filesystem mounted read-only. Do that, and let apps with unwanted side-effects fail fast. I can also try to extend this to my AWS estate by having very restrictive IAM roles and only allowing deployments via my CD system. I know it’s practically difficult to make everything 100% read-only, but I think it’s a good thing to aspire to.

As a final backstop I can undo mutation. Ironically, some of the best tools for this are the “old fashioned” converging infra tools — Puppet etc — which can detect and optionally proactively revert any changes. On the infra side we have terradiff, kubediff, etc. With such tools, we can continue to work the same way as with “immutable infrastructure” — for example: images can be pre-built with Packer, meaning we can boot and scale quickly and during build dependency outages. Repeatability is the same too; we still have a version-controlled declaration of what the infrastructure should look like at all levels, and with tools like Packer we can get it there and freeze it before anyone can use it, so it never changes under our users’ feet. By pulling in tools from the converging toolbox, we also have the added advantage of a “watchdog” putting any remaining mutable parts of the system back to where we want them, should they change.

To summarise: I’m not advocating upgrading anything in place. I’m saying we should use any appropriate tools to roll changes back, not forward. To ensure things don’t mutate, at least not for long. Yes, this means tools that enable in-place changes are installed, so immutability becomes a function of process and culture (I strongly recommend you read about gitops). All I’m saying is that given the intention to do that, we have the technology to help.

I don’t yet have a great name for this. Although it uses “converging infrastructure” tools, it’s not that, and I’m hearing “immutable infrastructure” used very loosely; to describe much less than I’m talking about here. My best name is “actively immutable infrastructure”, but I’d love to hear better suggestions in the comments or on twitter.