Delivering Infrastructure As Code

Codifying your configuration is only half the battle

Published in

The Tele2 Technology Blog

10 min readAug 13, 2020

Infrastructure as code is not a new concept, and like a lot of teams that manage centralized services, we have been keeping our service and infrastructure configuration codified and source controlled for some time.

We run our services mostly on virtual machines provided by another part of the organization, and while the base infrastructure and configuration are provided to us, everything on top of this is up to us — these VMs are basically a blank canvas where we manage the server and service configuration necessary for our PaaS service catalogue. As our infrastructure needs grew, so did our codebase, and everything was kept together, in our configuration repository. Before we knew it, we had quite a large monorepo. Or, was it a monolith?

We started to feel the pains of a sprawling codebase, with service configuration code leaking between modules and a myriad of Ansible tags controlling the execution of small sections of code. Soon, running a full playbook for any of our services became too risky for anyone to feel comfortable with, forcing some into making changes manually on our servers in order to ensure nothing else was affected. In some cases, our IaC codebase became something to update after a manual change was made. Disaster!

To gain the full benefits of using IaC, we needed to make a change; to take things to the next level and incorporate how our infrastructure changes are realized into a true continuous delivery context.

In order to accomplish this, we focused on three main areas:

Standardize our process for delivering change
IaC is next to useless if it competes with manual changes being applied to infrastructure, so to increase our confidence in the change process, our test environments have been standardized against production and are completely ‘hands-off’. The only way for changes to be delivered is through our new IaC delivery pipelines, which themselves are generated from code, ensuring we have a standardized flow for every piece of our infrastructure.
Introduce a high level of automation: everywhere
We have been using Ansible for provisioning infrastructure change already, but we have now introduced automated syntax validation and containerized testing as part of the delivery process — before anything even reaches our test environments. Full idempotency of our IaC is also checked, so we know our code is safe to repeatably execute. Should anything fail in any stage of the process, targeted notifications are sent immediately.
Increase our visibility and traceability of change
Traceability through auto-generated release notes in Jira has been included in our pipelines, showing when and by whom code changes were made. Auto-linked-and-closed Jira stories keep our Kanban board up to date and full visibility and history of changes made to any environment are recorded in reports available as soon as provisioning is completed.

But how did we get there? First, we needed to think about our delivery process — what is the desired flow from checked in code to production release.

What we came up with was the delivery model we would use for the entirety of our infrastructure provisioning: each code push triggers a pipeline that clones the code repository, imports its dependencies, and is syntactically validated using Ansible Lint. Next, a disposable test container representing an empty infrastructure server is spun up and provisioned using the assembled code. Assuming everything is fine up until this point, we can then run Ansible against any test environments we have specified in our pipeline configuration, visualize the changes in generated release notes, and finally, provision production:

The Big Split: The Repository Restructure

Before we could really get started with the implementation, however, we needed to do something about our codebase. All of our infrastructure code lived together in a single Ansible monorepo — great for convenience, but wasn’t going to work for just about everything else we wanted to accomplish.

Since we had quite a lot of roles, splitting up one repo per role would have potentially created a lot of overhead, so we tried to squash or group them where it made sense to do so and eliminate anything that was no longer being used. We ended up with a nicely pruned set of both individual shared roles and larger services broken out into their own git repositories, and we introduced dependency management to bring the two together.

These two ‘types’ of repositories would form the basis for our new strategy:

Shared Roles

These are not provisioned independently, but provide a supporting function to our larger services: monitoring and logging agents, common SSL certificates, and any other shared or common installs amongst multiple parts of our infrastructure.

Although each time one of our service IaC pipelines is triggered, these shared modules are imported and tested together, we have separate pipelines for these individual components, and are each tested separately as well. Nothing is actually provisioned anywhere permanent, but this setup allows us to speed up our development and testing on quite a granular level and automate validation and testing for each part of our infrastructure.

A Shared Role Pipeline in Jenkins — auto triggered and tagged with a version, but no target host — A Shared Role Pipeline in Jenkins — auto-triggered and tagged with a version, but no target host

Ansible Projects

These repositories contain the playbooks that are meant to be provisioned to manage one or many servers for larger applications: Jenkins, Gitlab, Nexus, etc. With these separated from the monorepo, we are able to gain:

automated triggers for building and testing each role, as changes are pushed to their codebases
the ability to test each role in isolation, in a controlled way
ability to isolate notifications of failures to a particular change in an individual role
significant improvements in change visualization
risk mitigation of hidden changes which are pushed but never tested

Bringing It Together: Dependency Management

The repository restructure solves another problem for us as well — how to automatically trigger server provisioning pipelines when a dependency changes. The answer is that we don’t. Instead, we treat them like any other application dependency and version them. This is where Ansible Galaxy comes in — an excellent tool for managing remote or local Ansible dependencies.

When a shared role is successfully built and tested in a pipeline, we tag its repository with a semantic version. This makes that role available from that specific tag by any other shared role or Ansible project by Ansible Galaxy at the start of pipeline execution. In this way, every playbook still has access to all other dependent roles they need, using versions that we know have been tested.

Any update to a dependent shared role will then require a bump of the required version in the relevant Ansible projects, which will automatically trigger their server IaC pipelines to execute. These server repos are also tagged with a version, for consistency, but also to provide a handle from which to generate our release notes from¹.

How dependency updates propagate through Ansible Projects

Walking Before You Run: Docker Testing

This is really where the rubber meets the road when it came to our pipelining. We knew from our past that our playbooks tended to drift from fully describing our infrastructure to being a tool to modify something very isolated in the playbook using Ansible tags. We were losing the spirit of IaC.

To rectify this, our new pipelines start up a Docker container of a base image that is representative of our target servers², which we then run our playbooks on. This not only tests a fresh install of our shared role or service, but we are easily able to test the idempotency of our playbooks by running them a second time and ensuring there is no change. This is essentially our continuous integration server, built from scratch, on every change.

it adds a bit of time to the running pipeline, yet the confidence boost we gain to push the change further is huge

There was a little bit of trial and error here to implement a solution generic enough to be used by any playbook out of the box for our generated pipelines, but in the end, it was worth it: it adds a bit of time to the running pipeline, yet the confidence boost we gain to push the change further is huge, especially when running inbuilt integration tests using a test framework like goss or testinfra as part of the stage.

Taking The Leap: Test And Production

At this point, we know that our playbook works and provides a running service which passes any tests that we have defined. So, let’s go to production!

Well, not quite yet. Since the Docker test container is ephemeral, we still need a prod-like test environment to run any verification and manual exploratory testing that is needed. The test environment also fits nicely into supporting a fully integrated E2E test environment for our services (but that’s a whole other blog post).

Running our Ansible playbooks automatically against our test environment is the last test of our code before we reach production, and as such, this quality gate should be treated with similar care as we give to production. If our test environment is kept healthy and prod-like, it will never be a blocker for the pipeline, helping keep our codebase in a releasable state whenever we need it to be.

At this point, our production push should be a non-issue, and the most drama-free part of the pipeline, as it should be.

Before we get to production, however, we want to guarantee visibility and traceability of our changes, and we do that by auto-generating release notes in Jira. Any Jira Ids we use in our code commit comments are auto-linked and closed once we reach production, keeping our Kanban board up to date and with a full record of when each story was released. At this point, our production push should be a non-issue, and the most drama-free part of the pipeline, as it should be.

So, the end result of all of this? We almost immediately started seeing the benefits: markedly greater confidence in pushing changes to our infrastructure, enabling smaller, more incremental releases. We know that all code is being executed, every time the pipeline runs, so there are no more hidden issues waiting to spring on us. And perhaps most importantly, we are getting higher quality releases — delivered continuously.

Some Things We Noticed

It wasn’t all smooth sailing, mind you, so here are some other decisions we made and things we noticed along the way.

Docker doesn’t solve everything

Our services live behind a proxy and the servers are registered against our internal satellite. Our Docker test containers, however, are not, so this leads to some issues if some packages we define in our playbooks are not available to both our test containers and our target hosts.

We got around this with a combination of some creative repository configuration and internally hosted packages, but it didn’t always feel like the most elegant solution.

Disable anything that is incompatible with running in a container

With containerized testing as an integral part of these pipelines, we want to avoid failures or issues caused by the very nature of this platform. For example, consider how reliance on selinux or proxies will behave in a containerized environment. Disable monitoring, log shipping, and any other service that can send data to an external system when not needed for testing³.

We stopped using Ansible tags

Why? Pipelines should produce a reliably repeatable delivery flow, and tags work counter to this goal. Their design is to run isolated parts of a playbook, which makes it possible for changes pushed to remain untested and hide issues until they pop up days, weeks, or months later. Multiple examples of this were discovered while migrating our Ansible code and is especially prevalent when upgrading Ansible itself. It may add time to the pipeline, but what we lose in expediency, we more than make up for in security. Tags have their place, but a continuous delivery pipeline is not one of them.

A better way to control how changes are made by a playbook is using ansible facts to isolate changes based on the target host state, and not user input. Design playbooks with this in mind: it will all run, all the time — and should be just as runnable for a fresh install as it is for managing already provisioned servers.

Doesn’t Molecule already do something like this?

Yes, and Molecule is great. There were a few reasons we didn’t decide to use it, but the biggest was probably the simplest. Instead of introducing something new for our team (and the developers this model could ultimately serve) to learn, we decided to keep things simple and build something on the tool stack that everyone was already using: plain Ansible and Docker⁴.

Perhaps in the future, we will reconsider, but for now, we are content with the choice.

¹ By recording the version deployed to production, we can use this to generate a git diff between it and the new versioned tag being deployed — a strong start for automating release notes.

² Massive inspiration was taken from Jeff Geerling’s test container, which we modified to suit our needs.

³ We solve this by detecting the Ansible connection type from inside the playbook, and make exceptions based on this.

⁴ We still have the ability to use the same test container we use in the pipeline for local testing, without Molecule having to be installed on the many different development environments being used.