<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Matt Turner on Medium]]></title>
        <description><![CDATA[Stories by Matt Turner on Medium]]></description>
        <link>https://medium.com/@mt165?source=rss-58df10956df9------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*I5FwuGbYmI1BO7kcIeXVzQ.jpeg</url>
            <title>Stories by Matt Turner on Medium</title>
            <link>https://medium.com/@mt165?source=rss-58df10956df9------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 24 May 2026 02:28:12 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mt165/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Latching Mutations with GitOps]]></title>
            <link>https://medium.com/@mt165/latching-mutations-with-gitops-92155e84a404?source=rss-58df10956df9------2</link>
            <guid isPermaLink="false">https://medium.com/p/92155e84a404</guid>
            <category><![CDATA[linux]]></category>
            <category><![CDATA[cloud]]></category>
            <category><![CDATA[infrastructure]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[docker]]></category>
            <dc:creator><![CDATA[Matt Turner]]></dc:creator>
            <pubDate>Sun, 11 Feb 2018 14:35:08 GMT</pubDate>
            <atom:updated>2018-02-11T14:35:08.339Z</atom:updated>
            <content:encoded><![CDATA[<p>Immutable Infrastructure has become a hot topic recently. I’ve written a <a href="https://medium.com/@mt165/immutable-definitions-f7e61593e3b0">couple</a> of <a href="https://medium.com/@mt165/a-spectrum-of-mutability-3f527268a146">posts</a> about it, and I think the term should be more strict than how it’s usually used. In my opinion, total immutability of infrastructure is a good aspiration, but not very practical.</p><p>The definition of “infrastructure” itself is blurred. Your app devs are now operators; <a href="https://www.youtube.com/watch?v=nMLyr8q5AWE">they operate their own code, on top of a platform you provide</a>. They specify the version and size and number of the containers running their product code. That’s <em>their</em> infrastructure, and no-one would argue that a new cluster should be stood up every time they want to push a new version. The raison d’être of the cloud-native movement is to enable them to do that <em>more</em> and <em>faster</em>.</p><p>No system really can be fully immutable (are <em>you</em> writing everything in Haskell?). <a href="https://twitter.com/copyconstruct/status/954133874002477056">Cindy Sridharan notes</a> that entropy will always build up, and one major source of that is the churn of the apps running atop your platform. It makes sense to let these apps change in place. (Immutable Architecture is a different beast — never changing the set of <em>services </em>provided by those apps by e.g. using protobuf to make sure the total API set only grows).</p><p>In response to a new build of an app, or adding one to its replica count, its cluster can be completely replaced with one containing the new version/scale, or it can be mutated in place (i.e. Pods replaced or added). While the latter might seem eminently more sensible, whichever you chose is kind of irrelevant to the argument I’m about to make. That said, I think it’s important to talk about the following in the context of the current conversation around immutable infrastructure.</p><p><a href="https://twitter.com/monadic">Alexis Richardson</a> has been posting a phenomenal series about “<a href="https://www.weave.works/blog/gitops-operations-by-pull-request">GitOps</a>”*, providing great processes for controllable changes to infrastructure. Kelsey Hightower <a href="https://youtu.be/07jq-5VbBVQ?t=900">has spoken about</a> applying the same principles to app deployment — a separate “infrastructure” repo for the Kubernetes definitions behind your apps, and deployments thereof by Pull Request.</p><p><em>*(In short: his thesis is that everything you run should be declared in git. Automated tooling keeps your clusters in sync with that single declaration of truth. All changes are mediated and discussed through Pull Requests coming in from dev branches.)</em></p><p>If a cluster catches fire, so be it. A new one is started, and Weave Flux re-deploys everything that was previously running, because it’s all declared in git. Right? Well, should <em>everything </em>about the system be declared in git? My first reaction was “yes” — declare everything in git, bring it all under control. But what about something like application scale? We can <em>guess</em> at this a priori, but it’s ultimately a function of the environment — of actual user traffic rates — not of some engineering best-practice. And we certainly don’t want it done ad-hoc, with a dev watching CPU loads in grafana and raising a PR every minute.</p><p>Let’s consider the opposite: what if scale isn’t declared at all? Kelsey Hightower has said it shouldn’t be, so that an HPA can be used. But what if a system has traffic necessitating 10,000 Pods? If that cluster needs recovering, the result will be a Deployment of <em>one</em> Pod. That will be totally overwhelmed by the traffic, probably compound the problem by failing its healthcheck, and certainly offer no useful service to its users.</p><p>So I assert that we do want the scale declared in git. And, although the required scale is a function of the environment and can only be known empirically, that loop should be automated too; this is the essence of DevOps. Consider a tool that watches the Deployment and auto-commits each new scale (like a reverse Weave Flux). Even with a separate (app) infrastructure repo, that would be so noisy that actual version upgrades wouldn’t be easily spotted.</p><p>With dynamic properties like scale, being roughly right is good enough. The CPU target is always 70 or 80%, so there’s headroom. It’s sufficient just to declare a nearby round number: a multiple of 10, or an order of magnitude. This is what I suggest; <strong>auto-committing the closest round number of your current scale. </strong>This will get the system back to a place where it can <em>cope.</em> It might be a bit slow, or a bit wasteful, but it won’t die. Declare enough to get the system back up with one click, and let the HPA take the fine-tuning from there.</p><p>From a manageability point-of-view, this “latching” behaviour keeps systems declared <em>well enough</em> in git, whilst not overloading operators with commits so numerous that they cease to have any value. This way, for example, they still function as audit logs — 3 users but a replica count of 10k probably means a computational complexity problem (or DoS attack) deserving attention. The automated tool could even PR each latch so it can be eyeballed to decide if its intentions are pure.</p><p>In GitOps terms, the “desired state”, i.e. that declared in git, is a rollback checkpoint; some things <em>are</em> meant to change, but if those changes go wrong, git will always describe the last, good, consistent state that you should go back to. All I’m saying is that a scale from 1 to 10,000 is something that’s material and should be checkpointed along the way. Think of it as a write-back cache maybe.</p><p>Clearly tools like kubediff either need to ignore this field, or understand the round-numbers latching policy.</p><p>Minimum scale should still be specified (it’s a function of your users’ SLAs, though it lived in the infra repo not code repo, as it’s the empirical result of that SLA married to a load test). Similarly, max scale<em> </em>can and should also be specified, again as a result of load testing (the point at which you’ve determined that 2nd order effects and the Universal scalability law kill you). These bounds are a function of the users’ requirements and the codebase, whereas run-time scale results from the environment.</p><p>As a further example, take blue-green rollouts. If a cluster is recovered from git that was an unknown way through a roll-out, what state should be recreated? It’s wasteful to go back to 100% v1, if it was 90% through upgrading to v2. Conversely, it’s unsafe to go all-out with v2 if the scant 1% that had been rolled out had failed their health-checks. I posit that the in-flight ReplicaSets should be watched their major progress milestones latched in git.</p><p>In conclusion, changes are inevitable. Whether you scale apps by adding more Pods to an existing cluster, or even if you do make a whole new cluster of <em>n</em> Pods every time, the problem is the same: some changes have to happen in response to the environment, rather than by operator diktat. Even with a mutating cluster, for purposes of recovery, audit, and easy roll-forwards, you still want an up-to-date description of every material aspect of it in git, but without overwhelming your tooling or operators. By <em>latching</em>, you capture the important details, while being pragmatic about the amount of incidental churn you want to be bothered by.</p><p>matt. @<a href="https://twitter.com/mt165pro">mt165pro</a></p><p><em>Many thanks to Hannah Morris for editing and Alexis Richardson for reviewing.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=92155e84a404" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Spectrum of Mutability]]></title>
            <link>https://medium.com/@mt165/a-spectrum-of-mutability-3f527268a146?source=rss-58df10956df9------2</link>
            <guid isPermaLink="false">https://medium.com/p/3f527268a146</guid>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Matt Turner]]></dc:creator>
            <pubDate>Wed, 11 Oct 2017 09:10:40 GMT</pubDate>
            <atom:updated>2017-10-11T09:10:40.779Z</atom:updated>
            <content:encoded><![CDATA[<p>There’s been a lot of discussion recently of “immutable infrastructure”, roll-forwards, etc, by <a href="https://medium.com/@mt165/immutable-definitions-f7e61593e3b0">myself</a> and others, including Alexis’ seminal post on <a href="https://www.weave.works/blog/gitops-operations-by-pull-request">GitOps</a>.</p><p>I noted in my previous post that 100% read-only infrastructure is a nice aspiration to keep in mind, and I stick by that. Practically, however, it may not be so sensible in all parts of a system. Consider the example of adding a rule to a Security Group in AWS — a test needs Ping, say, so ICMP needs opening. That’s a mutation, and it <em>could</em> break something else through some subtle interaction. So what’s the solution? Make a whole new AWS account with another copy of the whole system in? Deploy all my services and throw the Big Switch on an ELB to move the traffic across? Well, I <em>should</em> have a single source of truth in git which declares everything necessary to rebuild my system. So it’s easy, right?</p><p>To me there are a few problems with this approach. Even if this process is completely automated, it’s very complicated and that brings risk. Even if it seems quick in human time, quick compared to doing it manually, it’s orders of magnitude slower than starting a container. Even if you <em>think</em> everything that’s going to come up has worked together before, there are so many moving parts I wouldn’t want to bet my production systems on there being nothing not captured by my descriptions in git. There are second-order effects too — the caches in my new account will be cold, the moving averages for my monitoring and alerting will be zero, etc. And (promise you won’t tell anyone) stateless systems are a bit of a lie. Maybe a user is going through the checkout process, which isn’t serialised anywhere permanent until it’s done. While I might accept the risk that they’ll have to restart that transaction if I have a catastrophic failure (and blame their internet connection), I don’t <em>want</em> to make it happen unnecessarily.</p><p>What I suggest is a more pragmatic approach, based on a “spectrum of mutability”. I think about my stack from top-to-bottom as <a href="http://slides.eightypercent.net/linuxcon-ops-dividend/index.html#p1">Containers, Orchestrator, OS, Hardware</a>, and I accept more mutation the lower down I get. While I replace a containers for any change, I do minor host OS patching in place. Think of RAID — if well-managed, an array will continue to function and present the same functionality as before, with each layer noticing less and less as we go back up the spectrum (the kernel has to do actual work, service instances may notice degraded performance as the array rebuilds, users should be load-balanced off slow instances and not notice at all). You wouldn’t build a new data center to replace a blown disk.</p><p>Building on <a href="https://medium.com/@mt165/immutable-definitions-f7e61593e3b0">my previous post</a>, I still see the standard configuration management tools as a great fit here — chef, ansible, et al, as well as newer tools like terradiff, ansiblediff, etc. At some level your systems are always mutable — you should have a break-glass mechanism, and you better believe a cracker can mutate things if they want to. These tools can spot and revert changes whenever you don’t want them to occur. What changes down the stack is <em>intent</em>. At the “immutable layers” you promise <em>yourself</em> that you’ll never change anything — never change the targets you declare to those tools. On the lower, “converging layers”, some movement of the targets is ok.</p><p>In my next post, I’ll discuss some strategies for managing this.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3f527268a146" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Immutable definitions]]></title>
            <link>https://medium.com/@mt165/immutable-definitions-f7e61593e3b0?source=rss-58df10956df9------2</link>
            <guid isPermaLink="false">https://medium.com/p/f7e61593e3b0</guid>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Matt Turner]]></dc:creator>
            <pubDate>Fri, 22 Sep 2017 10:34:25 GMT</pubDate>
            <atom:updated>2017-10-04T15:59:26.562Z</atom:updated>
            <content:encoded><![CDATA[<p>I’d like quickly to talk about my definition of <em>immutable infrastructure</em>. I’ve heard the term used recently to mean a few different things, and it’s caused people to talk across purposes. I’m not saying I’m right, but here’s one definition for discussion.</p><p>Let’s recap on why we want immutability. Why not update software in place? Partly because it’s very hard to do that without down time. Partly because in-place file and database schema updates are hard, especially testing the combinatorial explosion of all the versions that might need to transition into each other. But also because state is a pain. The more state you build up, the more chance of it being erroneously structured, or pathologically big. There’s a reason systems like Erlang and Akka just flush state as their first attempt to fix a crash. State also doesn’t have to be deliberate — maybe you write logs up to the rotation maximum somewhere, then update to a different app version with a different log location. You’ve just accidentally doubled your quota; cue a weird bug due to a full disk in three months.</p><p>To me, immutability of a container or VM doesn’t just mean I don’t <em>intend</em> to update the software in place. It means I want to <strong>ensure </strong>nothing about that environment changes. All state should be off-box in a hosted DB. All logs go straight to Elasticsearch without touching the filesystem. <a href="https://12factor.net/config">All config comes through the environment, not a file</a>. Immutable also means not hacking up fixes in place — it means having everything that describes an app and its environment in version control, baking and pushing new images for every change, à la <a href="https://www.weave.works/blog/gitops-operations-by-pull-request">gitops</a>.</p><p>Why not take steps to <strong>prevent</strong> mutation? An app like the one I described above should be able to run in a container with its filesystem mounted read-only. Do that, and let apps with unwanted side-effects fail fast. I can also try to extend this to my AWS estate by having very restrictive IAM roles and only allowing deployments via my CD system. I know it’s practically difficult to make everything 100% read-only, but I think it’s a good thing to aspire to.</p><p>As a final backstop I can <strong>undo </strong>mutation. Ironically, some of the best tools for this are the “old fashioned” converging infra tools — Puppet etc — which can detect and optionally proactively revert any changes. On the infra side we have <a href="https://www.weave.works/blog/monitoring-kubernetes-infrastructure/">terradiff, kubediff</a>, etc. With such tools, we can continue to work the same way as with “immutable infrastructure” — for example: images can be pre-built with Packer, meaning we can boot and scale quickly and during build dependency outages. Repeatability is the same too; we still have a version-controlled declaration of what the infrastructure should look like at all levels, and with tools like Packer we can get it there and freeze it before anyone can use it, so it never changes under our users’ feet. By pulling in tools from the converging toolbox, we also have the added advantage of a “watchdog” putting any remaining mutable parts of the system back to where we want them, should they change.</p><p>To summarise: I’m not advocating upgrading anything in place. I’m saying we should use any appropriate tools to roll changes <em>back</em><strong>, </strong>not forward. To ensure things <strong>don’t</strong> mutate, at least not for long. Yes, this means tools that enable in-place changes are installed, so immutability becomes a function of process and culture (I strongly recommend you read about <a href="https://www.weave.works/blog/gitops-operations-by-pull-request">gitops</a>). All I’m saying is that given the <em>intention</em><strong><em> </em></strong>to do that, we have the technology to help.</p><p>I don’t yet have a great name for this. Although it uses “converging infrastructure” tools, it’s not that, and I’m hearing “immutable infrastructure” used very loosely; to describe much less than I’m talking about here. My best name is “actively immutable infrastructure”, but I’d love to hear better suggestions in the comments or <a href="https://twitter.com/mt165pro">on twitter</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f7e61593e3b0" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>