Five things learned using terraform to manage cloud infrastructure
HashiCorp’s terraform is a powerful and extensible tool for defining and creating cloud infrastructure in a repeatable way. At Olark we use it to manage a number of different environments on Google Cloud Platform. On the journey from imperative to declarative infrastructure we’ve learned a few things. Here are five that I feel are particularly important. What follows are entirely my own opinions.
Use remote state
Terraform works by processing a set of declarations in manifest files and producing a json representation of the desired state of your environment. It stores this data in a single file with the .tfstate extension and then generates a plan to drive the actual state of cloud resources to the desired state. Both the manifest files and the resulting state are persisted between runs of the tool, so at any given time if no changes are intended then running the tool against your manifests should produce no difference in the generated state file and no plan to alter resources. By default terraform stores the .tfstate file locally, in a directory named .terraform off the root of your project. This behavior is fine for experimenting, and might serve for a single engineer working on a project alone. As soon as you have multiple team members working in an environment local state just doesn’t cut it anymore.
Most problems that can occur fall under the general category of race conditions. Engineer A adds a thing and runs terraform, the thing is created and the local state updated, and then Engineer B comes along to do something else and removes the thing because her local state says it shouldn’t be there. One way to solve this would be to manage the state file in source control. If the latest changes to state are pulled before terraform is run, and the results committed afterward you could sort of make this work. The problem is the “ifs”. An inside joke in our group is that terraform can be “terrafying”. I’ve personally made a mistake with a .tf manifest and had to sit numbly watching as terraform removed over 100 vms that we were prepping for a new environment. This is not pleasant.
As soon as you have multiple team members working in an environment local state just doesn’t cut it anymore
A better way is to configure terraform to store the state file in a central remote location. Several different storage backends are supported, and we’ve chosen to use a Google Cloud Storage bucket. Once you have configured this by adding the appropriate declarations to your main manifest then everyone working off of that file will be reading and writing the same state, and many of the potential race condition issues are effectively prevented. One thing that remote state does not solve, however, is contention: engineers can still try to do conflicting things to the same resource at the same time. One solution to that is locking, which is supported on backends like etcd and consul, but not on GCS. We could adopt a locking backend, but for now we’ve chosen instead to adopt a simple rule: nobody writes to state from their local machine. That brings me to…
The most important characteristic of a terraform toolchain is that it be predictable: you have to know what the results of a given operation will be before you apply it to your environment. Terraform’s default operating mode encourages this. When you run the tool it produces a human readable plan without making any actual changes. If the plan is acceptable you can run the tool again including the
apply command to update resources. If the changes succeed then the remote state file is updated and the next run of the tool will see the new state as the baseline. That’s all fine, but in practice if you’re running the tool locally and writing to the remote state you still run the chance of resource contention, and I can’t even pretend that we’ve exercised all the possible edge cases for overlapping changes, etc.
To avoid all that we never run the tool locally. Instead we treat the manifest file like all the rest of the source code we manage: it is stored in a git repo and we run our builds automatically when changes are committed. Our git repos are self-hosted in gitlab, and gitlab pipelines are a nice abstraction that let us run our terraform builds in a way that is consistent with the protocol outlined above. We use two-stage pipelines to process commits to manifests. The first stage happens automatically on commit and runs the tool against the current manifest, producing a plan. If that stage succeeds then the plan output can be reviewed, and the second stage can be initiated manually by clicking a deploy button in the gitlab UI. If the commit was to any branch other than master then only the staging deployment can be run.
…if you’re running the tool locally and writing to the remote state you still run the chance of resource contention
The deploy step applies the changes in the plan and updates the remote state file. In this way we get a lot of control over the execution of changes using terraform and we have been able to avoid all of the issues mentioned above. Obviously this involves more overhead for individual engineers, but from my perspective as a guy who has accidentally deleted over 100 vms that feeling of security in knowing what will happen when I punch the button is more than worth the additional tooling and processes around deployments. This tooling has been working very well for us, but there is one little hole that it doesn’t close: what happens when something changes between the plan and apply stages? How to avoid that issue is the subject of the next tip.
Serialize the plan before applying
Ok so you set all this up and it’s very cool. You make some changes to your manifest and commit, review the plan and everything looks good. Just then you get a ping in slack and get threadsucked into a fifteen minute side thing. You come back, glance over the plan again, hit the deploy button and argghhh wall of red text. What the hell happened? What might have happened is that while you were dealing with the side thing someone else executed a terraform operation that changed the state of the resources, and now your changes can no longer be applied as you wrote them. There are as many possible reasons for errors like this as there are constraints to be violated in your cloud environment.
You can’t avoid your changes failing in this case: if the resources have been changed they’ve been changed. But you can find out about the issue without spamming operations at the environment to see if they work. The way to do this is to include the
— out argument when you run terraform on your first build step. This will cause the tool to serialize the current state as well as the planned changes to an output file. Your second stage would then execute this output file rather than reprocessing your manifests to produce a new plan. If the state of the environment has changed in a way that invalidates the plan then terraform will fail in a meaningful way without attempting to make changes in the environment.
The default organization for a terraform project is to have a single directory containing a manifest file that equates to some set of resources to be managed. There isn’t necessarily a strong mapping from that single directory to a single cloud environment. You could have multiple terraform projects managing different sets of resources in the same Google Cloud project, for example, but in order to keep things straightforward we maintain one project for each of our GCP environments: production, staging, etc. We could populate each of these environments with a long list of resources in the main manifest file, but there are a couple of reasons not to do this.
The first is the amount of repetitive boilerplate stuff you can end up with. If you need to do common things to every vm you provision, maybe installing a startup script into metadata, or altering the default access permissions, then you’ll need to repeat these things in every manifest declaration that creates that kind of resource. Another issue can arise when you need a staging or test environment that replicates the architecture in production, but at a smaller scale: now you’ll find yourself copying and pasting large chunks of stuff from one manifest to another and then editing the properties to get what you want. There’s lots of room for error.
… in order to keep things straightforward we maintain one project for each of our GCP environments
Fortunately terraform allows you to decompose an environment into modular sets of resources that can be created and managed as a group. Modules can take input variables with defaults, they can produce output variables that can be consumed by other modules or in the main manifest, there are lots of options. Modules can also depend on other modules, which has allowed us to create a low level module that defines a base vm type with all startup scripts and provisioning steps captured, and then to use this base module as a building block for higher level modules that define services such as a group of web servers. By defining variables to control parameters such as the number of servers of a type, their cpu, disk and ram you can easily reuse the modules in different environments with varied resource requirements.
One last thought about module input variables: if you’re going to have defaults for them give some thought to setting these appropriately. It’s not fun when you deploy a few dozen servers and realize that the default access permissions won’t work, or the default disk size is too small (or maybe worse, too large and costly). For variables that set resources like disk and the number of cores it is probably better to set the defaults rather low and allow workloads whose requirements exceed what is available to make explicit overrides at the point of declaration.
Keep related things together
Modules are great for creating reusable components that you can deploy into different environments but as with any structured syntax you have to make good decisions about how you compose your solution. My advice is to keep things that serve a common goal in the same module. To give a concrete example consider the list of resources that would need to be created to deploy a publicly accessible web service on Google Cloud: some instances, an instance group, a health check, a backend service, a global IP, a url map, http and https proxies, two global forwarding rules, and at least one firewall rule.
You could have a “firewall rules” module, and a “load balancers” module, and these might seem like reasonable lines along which to slice things. If you have a problem with a firewall rule go look at the firewall rules module! But in practice when actual reports of issues come in, or when there is a request for a change it is probably going to present in the context of a specific service. Maybe an integration with a new partner requires the web service to accept a hook on a non-standard port, or a new virtual host or path needs to be supported. In my experience it’s a lot easier to locate and modify the resources affected by requests like these when all the things necessary to provide a given service are collocated in the same terraform module.