What Version is Your Infrastructure?


I've often assumed that everyone knows the importance of application versioning and surprised when someone thinks differently. On the rare occasion that I'm confronted with someone who insists that they shouldn't version anything, I stare at them and die a little inside. Luckily for my mental health, most technologists agree that versioning is important and yet disagree on the method. Application versioning schemes are one of those debates like tabs vs. spaces or what way to turn the toilet paper roll. I've spent hours debating the advantages of one type of versioning over another. I'm here to say is that I don't care about the versioning style. I care about what is not getting version.

Despite years of collective pain trying to obtain knowability (also known as what the hell is going on) around what code is running in production our industry is only in the infancy of ascribing the same type of knowability to the infrastructure. Think about it. If you know the version of your software artifact but don’t have certainty about the underlying platform in which it is running, do you really have any certainty about your application? It is like versioning your application to 3.something or 4.maybe because you are missing a key portion (the subversion) of the information needed to know what is exactly running. With the first version number you have a foggy idea about what is going on, but that’s it. This leads us to the question:

Can you tell me what version is your infrastructure?

Being able to answer this question is at least as important as being able to answer about your application’s version.

If you still don't agree with me about why you may want to know what the exact footprint of your application’s infrastructure is, think about the following questions:

  • What do you consider infrastructure? What is meaningful to you about the platform in which your application is running?
  • What is the first question you're asked after you report a bug?
  • How do you know what shared libraries an application is using at runtime?
  • What is the version of the application framework or server running your application?
  • What version of the OS is running your application? When was it last patched?
  • What deviations to your OS’s configuration were made since it was installed? Better yet, what install options were used when it was set up?
  • What is the version of the database that your application is connecting to?
  • What are the firewall rules needed to secure your application?
  • What scheduled jobs need to be run to make your application function? Are they invoked from another system?
  • Are you selling a (metaphorically or literally) boxed software product that you have zero control over where it is run? In that case, this article may not be very helpful because you are out of luck unless you can run it in a container. If you can do that, then keep reading.

If you answer involved sshing into the box or checking a wiki, you should keep reading.

Converging on Immutable Infrastructure

Two decades before the term Immutable Infrastructure was ever used, Software Configuration Management (SCM) was an area of study for academics and professionals alike. This discipline is somewhat generalized because it focuses on all aspects of the state in which a software system could be configured. Yet, infrastructure state changes were very much within its domain, so in fairness, we can't begin to talk about Immutable Infrastructure without acknowledging the efforts of the SCM community. Moreover, just the creation of this discipline alone was a jump forward in the knowability of production applications because it brought some modicum of self-awareness to the act of software deployment and hence infrastructure state changes.

I couldn’t find out definitively who coined the term Immutable Infrastructure but I did find references from Chad Fowler in 2013 and it wasn't until January of 2014 that the searches for the term started to take off. I find the term a good approximation but ultimately an inaccurate description of reality where interpreted literally. I don't want to say that completely unchanging infrastructure abstractions aren’t possible, but there are going to be state changes on the underlying hardware, so I don't see it coming anytime in my lifetime. You can think of immutability as one end of a spectrum. On one end you have an unknown possibly changing infrastructure and on the other end you have a completely known unchanging infrastructure. In this sense, immutability is tied to knowability. Thus, if you don’t know the state — how can you make the claim that it is unchanged compared to a previous state.

All infrastructure is mutable to some extent — otherwise it wouldn't be useful. What would a computer be without state changes? What would it compute if there were no inputs? With the current state of software system complexity, practically speaking no one can predict with absolute certainty what instructions will be running on a system at any given time when something like a web service call is made. Sure you can make a million technical arguments about profilers or instructions executed by your application, but at the end of the day with a typical multi-threaded operating system full of junk I don't believe that any non-academic operator has any form of knowability about the actual instructions run as a whole on a system. There are so many things going on at once interacting with each other before we even start to go deeper and look at the influence of cosmic rays on variability of system state. This is why the halting problem is so confounding. With the typical systems of our era, you don't even have certainty about what software is actually running yet alone when that unknown application will complete! That said, I'm sure that I'm technically wrong in some theoretical sense, but I'm convinced that I’m pragmatically correct.

Now, if we move away from absolute definitions of immutability towards more practical forms of immutability, what does this mean? For one, this becomes less of a discussion about the science of computing and more a discussion about the craft of creating knowable systems for human agents. Thus, we arrive at a practical definition of Immutable Infrastructure as a type of infrastructure that minimizes unknowable state changes.

Before the era of virtual machines, technically mature organizations took great care to standardize the hardware and operating systems for application server clusters. This was a manual process that became automated with the advent of disk imaging utilities or network booting technologies. These were both huge strides in getting us toward a more immutable definition of an application’s host system because they provided predictability about the initial state of the systems running an application.

At this time, setup scripts for building machines were created on an ad-hoc basis with no standardization. Each company had their own way of building systems and as time went on the different systems that started in the same state would start to slowly diverge and become less predictable. There may have been common patterns, but not a clear versionable artifact defining infrastructure with the arguable exception of disk images. However, we did start to see the precursors of a more immutable infrastructure with tools that would reboot systems between user logins and restore the OS to a predefined state.

Although VMware was founded in 1998, it wasn't until after 2001 that we started to see the wide scale adoption of virtual machines in data centers. Once they came on the scene, the operational efficiencies started a quiet revolution. While improving the cost to performance ratio by increasing the utilization of the underlying hardware (in terms of power consumption and density) virtual machines also improved the knowability of infrastructure. They did this by allowing the free exchange of disk images as a common artifact of an operating system’s definition. Not only did VM images provide a known starting place like physical disk copies and network boot configurations but it also provided snapshots that would allow you to revert to known states. Suddenly, you could efficiently develop standardized OS images that you could share within an organization. However, the sheer size of the disk images of virtual machines limited the utility of the solution in terms of reusability between developers. It was the operators that benefitted primarily.

Later, tools for building and configuring virtual machines started to mature and we started to see tools that would configure a virtual machine base disk image using a script or recipe. Tools like Puppet, Chef and Vagrant work on this principal. By storing an identifier for the base machine and the steps needed to build set up the machine for an application in a versionable artifact we were able to get one step closer to the notion of immutable infrastructure. However, there was a problem with this model. There was no way to guarantee that the base machine disk image that the VM was running was consistent across different VMs unless you have defined that with the same toolset. This led to setup scripts failing when switching from an in-house integration server running VMWare and a production server running on the public cloud. There were often minor differences in the configuration of the operating system that would lead to major headaches due to the lack of portability, yet the net benefit of these solutions were such that there was wide-spread adoption.

A Old Solution Re-emerges

At the same time that script-driven configurations were taking off, the market share of platform as a service (PaaS) models on the cloud started to gain market share. Compared to the script-driven configurations and virtual disk images, PaaS promised to simplify the process of knowable infrastructure. You would just write an application and it would run on someone else’s infrastructure. As long as your application conformed to the limitations of the platform, you didn't need to worry about the underlying infrastructure. You could just move a slider to create more running copies of your program that got auto-added to a load balancer. This was an amazing promise — now we can just outsource our infrastructure. For simple applications, this pattern worked well and still works well. That’s why many startups deploy their applications to Heroku. It saves you a lot of time in the simple use cases. However, once your application starts to demand more from the underlying infrastructure this model can become untenable. Most of the PaaS providers provided a means to customize their base deployment images, but it was often poorly documented, cumbersome and specific to a single vendor.

Recently, containerization has gained traction as a middle of the road solution between IaaS and PaaS. There is an important lesson that is emerging from the Docker implementation of containers. It was a PaaS provider that kickstarted the container revolution by creating Docker. Containerization as a technology has been around for a long time in technology years. From BSD jails (March 2000) to Solaris zones (February 2004) and now to LXC (February 2014) we see an odd degradation of feature sets. With jails and zones being more mature technologies, why did LXC (and thus Docker) make such an impact? One could say it was because Docker is native to Linux or you could say that it is because Docker cared about the developer experience. I would say that the success lies in the developer experience but also in the knowability that it brought to the creation of OS platforms.

With Docker, you have absolute certainty that the underlying image is constant. This is also true for virtual machines, zones and jails. However, the key difference is that every single change from an immutable base abstraction is easily versioned and discoverable. You get a community standardized artifact in the form of a Dockerfile that you can use to trace the build steps to make the platform image. Each build step can be its own image and shared with other developers. This enables another stream to enter into your software development lifecycle (SDLC). Now the infrastructure for your application’s OS can be versioned independently of your application. It can have its own SDLC. It can have its own inheritance model. It can have platform image specialists modify it outside of the scope of the application and all of those modifications are known and recorded.

Separating the infrastructure SDLC and the application SDLC is now easy

What Good is a Container if You Don’t Know the Ship it is on?

With containers you can shorten the time of unknowable mutability in infrastructure, but it doesn’t address the problem of the mutability of the underlying host platform. We just have faith that it works and do whatever incantations that we can to ward off the evil spirits from host running the container. However, even though containers are a great step forward towards Immutable Infrastructure, it is only addressing a single computational unit (ie machine) and they don't address the complex interaction of all dependent systems that are interacting with the application over the network or other IO channels. In other words, versioning containers doesn’t address the versions of everything else that you connect them to, but it is a step in the right direction.

To the extent that other systems (not the versioned application) are knowable, we should aim to shepherd them towards an immutable model as well*. This means that we need to be able to have knowability into all versions of each module of a system and the compositions of how the modules are linked together. In this model, you would have an insight into the state of your data stores, networking settings and the state of any coupled services. If you could get a coherent version identifier from them, then you would be able to build a version identifier for your entire virtual data center. It would be purely a hash of the individual component’s versions combined. Recording the identifiers and correlating them to versioned artifacts would typically be the job of a continuous integration and/or continuous delivery system.

An example of how a version could represent the knowable state of a virtual data center.

Then one can trace each of the version identifiers through the build system and ultimately to source control. One could embed the versions directly in your application so that error reporting tools could correlate directly with the deployed versions of the entire infrastructure.

With a central version identifier, you have a central authority that provides knowability about the state of every single component within an abstract boundary (virtual data center), so with the version id above if I wanted to know what version of a REST framework was running and what port was serving requests for my microservice it would be as simple as parsing the version id: micro-2.1-os-0.9 and looking up the correlating artifact in source control. The key point is that your master version identifier can be parsed such that you can find out the exact version of each component in production as it is stored in source control. Moreover, it should be composable so that it can be part of the version identifier of another component, so for example the large string represented in the diagram above could be embedded in the version identifier of a larger software system.

An Exercise for the Reader

As I stated in the beginning, I’m worried about what is not being versioned more than what is being versioned. In that spirit, I've composed a questionnaire for your enjoyment. Check each practice that you are doing.

What did I forget to version?

☐ All software built from source code deployed to production is stored in source control. Are we sure?

☐ All software not built from source code (third-party binaries) has version identifiers.

☐ We know and can track the version of all of the libraries that our components depend upon.

☐ We have systems set up to detect convergent dependencies without them auto-upgrading without our knowledge.

☐ We proxy all external component systems (eg Maven, Nuget, NPM, CPAN, gems, etc.) so that the version of libraries added is always discoverable and forensically analyzable.

☐ We never curl -s http://dodgysite.com/useful-utility.sh | sudo bash anything onto our infrastructure, right? We understand that we are downloading an unknown version that could p0wn our entire data center.

☐ We understand what is configuration in our infrastructure and what isn’t. Do we really?

☐ Our build and compilation environment, options, linked libraries and optimization flags can be derived from our version identifiers.

☐ We know the version of all of the APIs that we are expecting when communicating as a client.

☐ We know the potential versions of all of the APIs that we are hosting as a server.

☐ Our load balancers, reverse proxies, application delivery controllers, SSL termination devices or any network infrastructure’s version is known and relatable to other versioned artifacts.

☐ We maintain a clearly versioned schema for our data store. We know the runtime structure of the data store and can relate it to the version of any of its accessors.

☐ We have all of our firewall rules well documented and versioned, so that no one can sneak changes in without us knowing what should be there.

☐ All of our network linkages are versioned. We know what components talk to each other and what components don’t. We can track when there are changes between network component coupling.

☐ We know the settings for all recurring jobs and we feel confident that those are the correct settings.

☐ The OS of each component is clearly identified and the version is discoverable.

☐ The configuration of the OS of each component is versioned and we have absolute certainty that we are starting software from a known state.

☐ All version identifiers can be related to source code or to identifiers that the vendor understands.

☐ All version identifiers are stored somewhere useful to developers, operations and product managers such that everyone knows what version of what is in production, staging or test.

☐ All purpose specific hardware is documented and the underlying hardware for systems can be rebuilt in a consistent state or this doesn’t matter because we are using containers or virtualization.

☐ If we are using containers or virtualization, the underlying virtualization or containerization framework’s settings are clearly documented and referenceable.

☐ When building our OS images, we use frameworks that allow for versioning and we store the versions in a way that is relatable to what is deployed.

☐ We check the hashes for binaries that we have downloaded and installed on our OS images.

☐ We have systems set up that allow us to do production debugging analysis and testing so that we don't have to violate our core versioning principles (ie devs don’t have a ssh terminal open editing source in prod to diagnose a problem). These systems may be things like canary deployment tools (like Distelli), production-safe profiling tools (like Dtrace), or application performance monitoring utilities (like New Relic).

☐ When our software core dumps, we know the versions of caused the core dump and we can pinpoint exactly where the cause was in our source code or libraries.

☐ We know who pushed what when to where.

I think I had some occupational flashbacks while writing that list because it involved some painful self-reflection (please don't judge my github account!).

Take a good long look at the boxes that you didn’t check. Are these things important? By not having versioning associated with them, what is the risk? What is the benefit of not versioning it?

Once you have a good handle on the versions that you are collecting, it is important to store the version identifiers in a place that is useful to your organization. A text file in git that tells you what got push to staging or production may be sufficient, but it is important that such a system exists.

Next Steps

I've laid out above the conceptual framework about how it is now possible to design systems such that they always have a knowable state. Naturally, the next questions are: How do we build it? How much work does it take to set up? Are there trade-offs between knowability and productivity? Can we use an existing project to set this up?

In the second part of this series about immutable infrastructure, I’m going to be answering these questions and diving into what you can do today to make this work in the cloud.