Cattle — not Pets: The Automation Journey with our Build System Infrastructure

Published in

Parkside

8 min readFeb 4, 2019

Don’t treat your servers like pets; treat them like cattle instead.

I was sitting in the audience at a #DockerconEU talk last year when I heard that. The speaker explained that we shouldn’t coddle our servers and systems as we would with pets. Instead, when something goes wrong with one of them, we should simply replace it with another one in good condition as a farmer would with cattle. Nursing failing systems back to health wastes valuable time, and more than that, it’s time wasted anew the next time they fail. Virtualisation and containerisation technology allow us to dispose of sick systems and substitute a clone of a healthy one in a matter of seconds. It’s a pretty common practice these days, for good reason.

It makes the most sense in the world.

Yet, when I heard it that day something clicked; the build machines and servers that I was responsible for were coddled, and I was often wasting valuable time diagnosing and treating their ailments. I left with a determination to change things; for my last few weeks at Parkside, I was going to switch our continuous delivery infrastructure from pets to cattle.

I switched to the DevOps team within Parkside less than a year ago. This was at a time when the need for smooth automated builds became clear due to our company’s accelerating growth. I took over responsibility for our build system, which was developed on Buildkite’s platform. The cool thing about Buildkite was that we could bring our own machines, which we host ourselves as virtual machines in our local datacenter.

Buildkite lets you Bring Your Own Computers (Giphy).

The downside to this, of course, is that we have to bring our own machines. In other words, the responsibility for maintaining the build system infrastructure falls to us. We believe this is a small price to pay for the ability to customise our build machines precisely to our needs, but it means that we have to be able to address issues with our build machines when they arise and provision new ones as our needs grow.

The only problem was that somebody was doing all of that manually: me.

I pitched my intention to automate creating our builders to my team lead. At the very least, I wanted a command that I could type that would spin up a new virtual machine, install all of our build tools on it, and register it with Buildkite — a process that took me about twenty minutes to do manually at the time. He responded with enthusiastic approval.

And so, I set out on my automation journey.

Initial Ideas and First Steps

I knew from the start that vSphere (the software that we use to manage our datacenter) had an API that we could use to create and destroy virtual machines from templates that we created and stored in the datacenter. This gave me some fanciful ideas of what could be accomplished; for instance, I could create a script that would start from a virtual machine template and customise it using different rulesets, each for a different purpose. The rulesets would just be a subset of the Dockerfile syntax:

FROM vsphere/vcenter.orbit:centos-7.6-docker-18.09
USER --ssh-key="..." ci-provisioner
ADD ...
RUN ...
CMD [ "sudo", "systemctl", "start", "buildkite-agent" ]

The idea almost seemed elegant. Dockerfile automated container image builds; Herderfile (yes, I named it) would just push the concept up to the next layer and automate virtual machine image builds. I could use this not only to customise a generic virtual machine into a build machine, but for any other type of virtual machine that we would want to automate in the future. It made perfect sense.

[…] what I was trying to accomplish was so obvious that there must have been some tools already out in the wild capable of doing it.

Not long after, I decided that it not only was an enormous undertaking for a single developer to write this kind of build system, but what I was trying to accomplish was so obvious that there must have been some tools already out in the wild capable of doing it.

In fact, after a brief conversation about what I was trying to accomplish, one of my colleagues sent me a link to Ansible’s vSphere module documentation. In case you don’t know (I sure didn’t at the time), Ansible is a tool that lets you orchestrate remote machines using a pre-written set of actions called a playbook. Lots of people use it, and for good reason; it’s powerful and lets you accomplish quite a lot in minute detail.

There was only one small snag for me. Ansible needs an inventory of machines to work from; it applies the commands from the playbook to every machine in the inventory. The problem was that my application needed to modify the inventory itself, i.e. I wanted to add new machines. This isn’t really a dead-end per se; Ansible has dynamic inventories, but they require writing a plugin. I saw two problems with that:

I wanted all of my configuration and scripts to reside in a single repository. Ansible’s plugin discovery architecture made this a bit more difficult than I cared for.
Laziness being a quality of every decent programmer, I wondered if there was an easier way.

And there was.

Terraform

I found Terraform after looking for Ansible alternatives. It should be said that Terraform’s goal is rather different from what Ansible set out to accomplish. While Ansible focuses on orchestration, Terraform’s concern is infrastructure management. You would tell it what you need, and it would add or subtract units within your infrastructure to make it happen.

Once again, something clicked; instead of approaching the problem as creating a new build machine, why shouldn’t I approach it as describing what a build machine is and how many I need? This shift in thinking was the spark I needed to finish the project.

Now, my goal was to run a single command to update our build systems to match our desired configuration. With the tools that Terraform provided, I knew that it was exceptionally possible.

How it works

In Terraform, you describe resources. Resources are the units that comprise your infrastructure — think of things like virtual machines, network addresses, etc. Sometimes, you may assign a count to a resource. Then, when you run terraform apply from the command line, Terraform compares your configuration’s counts and resources to its record of your infrastructure. Should they be different, it either creates or destroys resources so that your infrastructure comes to match the configuration files. Finally, it records the current state of your infrastructure.

It also allows you to assign provisioners to resources. These are actions that are run whenever a resource is created. For instance, we could use provisioners to install packages on new virtual machine resources.

For our setup, I described the build machines as a resource and gave it a dynamic count that was based on the length of a list variable that contained metadata for Buildkite. It looks like this:

variable "buildkite-tags" {
  default = [
    "group=avengers,agent=tchalla,queue=builder",
    "group=avengers,agent=fury,queue=builder",
    "group=avengers,agent=stark,queue=builder",
    "group=avengers,agent=romanova,queue=builder",
    "group=spider,agent=hyde,queue=builder",
    "group=spider,agent=jekyll,queue=builder"
    "group=spider,agent=vanjee,queue=builder"
  ]
}

This way, should someone want to add a new build machine, they only need to add its metadata to the list and run terraform apply again. This time, Terraform would recognize that the count of our builder resource has changed and create an additional build machine. Similarly, they could remove a build machine by removing its metadata from this list.

I believe this approach is a firm foundation. However, I did encounter some problems which I needed to work around.

Command elevation with sudo

Sometimes, in a provisioner, I needed to run a command as the superuser. Normally, this can be accomplished by running sudo cmd args. Whenever sudo prompted for the superuser’s password in a provisioner, I found that Terraform would not reproduce those password prompts back to my terminal and the provisioner would just hang until Terraform timed out and gave up. In any case, when running an automation script, I felt that the less input I am required to provide, the better.

I solved this by creating a generic virtual machine template in vSphere, within which I placed the public key for a special terraform user. This user is configured to login with a matching private key and run elevated sudo commands without a password prompt. For added security, it also rejects remote logins which use a password, requiring that agents know the private key.

Why not just make a virtual machine template that’s already customised for Buildkite? That would surely eliminate the need to run provisioners in the first place. However, doing so would hide a lot of information; by looking at the provisioners, everyone can see just what steps are taken to transform a fresh install into a build machine. More importantly, lazy as I am, I cringed at the thought of having to create a new template every time we decided to enhance our build machines.

Monitoring our Build Machines

We use a Prometheus instance to periodically collect metrics from all of our machines and store them. In order to do that, Prometheus needs to be configured to know the IP addresses of all of our machines so that it can poll the metrics endpoint that they expose.

I toyed with the idea of automatically updating the Prometheus configuration by inserting the IP addresses for new build machines. However, detecting which machines were new and which ones were old seemed near to, if not completely impossible within Terraform’s declarative data model. To put it another way, the data model makes it easy to check the current state of things but not how things have changed to get there.

It occurred to me that the correct way to approach this would be to declare the Prometheus configuration in full rather than trying to mutate it. Yet the idea of having to template the full Prometheus configuration just to update the section about the build machines made me feel icky. Prometheus doesn’t support include-files in its configuration either, so that killed any hope I had of a quick workaround.

Luckily, Prometheus supports running in federated mode. I could have multiple Prometheus instances running, each collecting different metrics, and just set the main one to copy the metrics from all the others. So, I created a separate Prometheus instance for the build machines and its configuration is regenerated in full whenever Terraform is applied.

The Terraform way of describing resources and counts turned out to be a very natural way of describing our build system infrastructure. It’s fast too; this process of creating a builder that took me twenty minutes to perform manually, now gets done in just under five minutes.

What’s more, we can now very quickly replace sick builders using only Terraform, bringing the “cattle, not pets” mantra fully into use. For example, we could replace a sick builder with only the following commands:

terraform taint -module=builders vsphere_virtual_machine.web.4
terraform apply

We are pleased with the configuration that we came up with and are already thinking of other ways in which Terraform can help us automate other parts of our infrastructure.

For instance, I described our Rancher 2.0 server deployment using Terraform, while a colleague considered using it to automate deploying batch files to production servers.

Over the Christmas break, I decommissioned all of our old build machines, then created new ones by running only terraform apply on my terminal.

It was the most satisfied I’d felt in a very long time.