Photo by Pok Rie: https://www.pexels.com/photo/seaport-during-daytime-132037/

Terraform: 10 tips to retain your sanity

Published in

dzangolab

10 min readApr 13, 2022

Devops is hard. Despite Terraform being an excellent product, it still has a steep learning curve, and can induce a lot of monitor thumping frustration. This was made worse by the fact that I am a software developer, not a full-time devops guy, so I only do devops as and when needed.

While not claiming to be an expert on Terraform, I’ve accumulated several years of experience using it. Here are the rules that I go by in order to get stuff done and get the most out of the product. I hope that they will help you get started faster, or help you avoid some mistakes I made.

Know why you are doing this

Despite their claims to be simple, devops tools are complex to learn and use. Manually setting up a server is always going to be faster than using Terraform. You can setup a decent infrastructure on DigitalOcean with a few clicks. So you need to have a good reason to do devops. In my case, the reasons are as follows:

Reproducibility: Every time I run the same script, I expect the same outcome.
Speed: I need to feel confident I can reinstall an infrastructure stack in less than 1 hour. I want to make it so that in case a stack crashes, it’s not even worth my time to try and figure out a fix. I can just reinstall a new stack. (Then I can try and figure out what went wrong).
Documentation via code: The whole concept of IaC means that the code is the documentation for my infrastructure (or most of it).

Document what you are doing

That’s always a good thing to do, of course, but if, like me, you could be weeks or even months between bouts of devops work, you want to have everything documented.

The order of the various steps matter. This is not always easy to record just in the code, as folders and files are listed alphabetically (using a step number as a prefix to file and folder names turns out to be cumbersome: if you have to change the order, you could have to rename a lot of files or folders).

In addition, there are often a few steps that don’t get automated. Sometimes, making a small change to a resource would require destroying and reprovisioning it (eg adding swap to a server, increasing the root volume size), and that effort may be too burdensome, or I don’t want to go through the reinstallation part (ie all the stuff after the servers are provisioned). Or simply I don’t know how to do a particular operation, or the corresponding module is not available, and I don’t have the time to learn or get it done. I’d rather get the infrastructure setup and running, and add a note to the docs, than delay everything because of a dogmatic belief in IaC.

I normally keep a simple README file in each relevant directory (at stack level). It includes a section on Provisioning and another on Destroying the infrastucture.

Use a remote backend

Terraform has 2 very powerful features. First, it maintains the state of your infrastructure, so that you can run your Terraform scripts repeatedly: if nothing needs to change, nothing will happen. Second, Terraform keeps locks, so that different users can’t both modify the infrastructure at the same time.

This is really only available if you use a remote backend.

Terraform does support a local backend, which is just a file inside your terraform directory. This can then be version-controlled.

However, if you are not careful, this can expose some sensitive values (passwords, API keys, etc.) which would then be visible by anyone with access to your repo. In addition, you have no lock to protect you against another person trying to make change at the same time.

Even if the actual provisioning by 2 people at the same went well, if the second person pushed their update state before you, you could end up having to manually resolve conflicts within the state file. That is not a happy prospect.

So: Use a remote backend.

Terraform remote backend: the chicken and egg problem

A popular remote backend provider is AWS: an S3 bucket for the state, and a DynamoDB table for the locks.

It is of course tempting to provision these same resources with Terraform. But clearly they can’t have their state/locks stored in the resources they are tasked to create. Previously, I would simply create these resources by hand.

Thankfully, there is a helpful module to take care of this: https://github.com/stavxyz/terraform-aws-backend.

Use a repo; 1 folder per stack

Since Terraform is an IaC tool (infrastructure as code), it seems pretty obvious that your devops code should be in a repo. Ideally, that repo should not contain any sensitive information (see section above about using a remote backend). Of course, just publishing your stack is a security risk, so the repo should not be public. But there are benefits to making it widely shared among your team. We believe that full-stack devs should also do devops.

We use a repo named infrastructure for Dzango’s own infrastructure. For customers, I usually create a devops repo, with a folder named infrastructure for the IaC code. Other folders would hold Terraform modules, Ansible roles, ssh keys, etc. For Dzango, we have separate repos for these because we have a lot of them, which we make public.

I like to use a single repo for all our infrastucture code (or folder in a single repo for customers), because it’s simpler. All resources are available on hand.

Within that repo or folder, I like to have 1 folder per infrastructure stack, named after the stack: For example, our production stack would have its own folder named production; the staging stack would have its own staging folder. Our ELK stack is shared by both production and staging, so we have a separate elk folder for this.

You could of course keep 1 repo per stack, but I find that there is a lot of copy/pasting involved (eg after defining your staging stack, you can easily copy/paste it to the production folder) and it’s just easier with everything in one repo.

Keep your resources separate: Handling different resource lifecycles

Warning: This is a very controversial tip! Terraform purists may disagree.

Apparently, the preferred way to use Terraform is to define all your resources in a single configuration. The state will be saved to a single file. I have several reasons to disagree with this approach.

A somewhat weak reason is historical. Back at the time of Terraform 0.11, the state file was brittle, so if anything went wrong for 1 resource, the whole state file would get corrupted and unusable. Also, because Terraform inspects your existing infrastructure before doing anything, the larger the stack the longer it takes. I hear from Youtube videos and blogs that since Terraform 0.12 (currrent release is 0.15) the brittleness of the state file has been resolved, and performance has been improved.

The main reason however is an aspect that I have never seen discussed in the Terraform litterature: How do you handle resources that have a very different lifecycle? Let’s assume you want to provision a DigitalOcean droplet with a floating IP address. Over the course of the coming years, you may destroy and reprovision the droplet several times, either because it’s crashed, or to upgrade it to a new release of your chosen distribution, or increase its size, etc. But during the same period you would not change the floating IP address. This IP address is likely set up in your DNS records for your apps and websites. You really don’t want to have to modify this. For one thing, if you destroy and reprovision a DO floating IP, you have no guarantee you would get the same one; so you would have to update your DNS records, with a potential interruption of service.

You could of course use Terraform to provision your DNS records, but I’ve never felt the benefits justified the extra effort.

Admittedly, Terraform manages the state of your infrastructure, so if a resource’s specs have not changed, Terraform will not do anything to it. If the only change to your config is related to your DO droplet, then Terraform will only make changes to that droplet. (Depending on the change you want to make, it may have to destroy and reprovision the droplet; if it can make the change without destroying the droplet, it will do so.) So in theory it should be OK to define everything in a single config/state.

Also, to be fair, Terraform does allow you to define various parts of your infrastructure in separate files. All the .tf files in a directory will be read by Terraform. So you can keep different files for different resources, and only modify the file related to the resource you want to update.

This is all well and good, but I still feel very uncomfortable with this approach. I very much prefer to keep resources separate, so that when I work on one, I simply have zero risk of damaging another.

To that purpose, I keep each resource under a separate directory. Under staging, for example, I have the following folders (here using DigitalOcean):

droplet
firewall-db
firewall-public
floating-ip
project
tags
volumes
vpc

Use Terragrunt

Unfortunately, using 1 directory per resource with Terraform is cumbersome, with a lot of the configuration settings having to be duplicated. Enter Terragrunt, a “thin wrapper that provides extra tools for keeping your configurations DRY, working with multiple Terraform modules, and managing remote state”.

Terragrunt allows you to share config settings and variables across sub-directories, thereby making this “1 directory per resource” setup feasible. Terragrunt also allows you to define your backend (state and locks) once.

I like to use a single AWS S3 bucket, with folders for each stack, and sub-folders for each resource. The S3 bucket structure thereby maps the directory structure of your devops/infrastructure repo.

Use Terraform modules

Terraform introduced modules in an earlier release, and these provide a great way to group together related resources. With one command, you can provision (or destroy) several related resources.

Again, defining your whole infrastructure in a single module makes no sense to me. But because of the granularity of the resources, some definitely belong together.

Say you want to provision an AWS EC2 instance with an EBS and an EIP. Because the EBS and EIP are much longer-lived than the EC2 instance, I would provision these 2 separately. But the provisioning of the EC2 instance would require various other resources, such as an “ebs association” resource and an “eip association” resource. These would hold the link between your instance and the EBS and EIP, so they definitely belong with the instance. This is where I would use a module that would allow me to provision the EC2 instance as well as these associated resources together. The actual EBS and EIP ids would be passed as input variables to this module at runtime.

Modules are also a great way to keep your IaC code DRY. The Terraform registry is of course the first place to look for modules. When we can’t find the appropriate module, we write our own, but always with some view towards reusability in different stacks. We publish all our Terraform modules, and while we provide them “as is” with no implied or explicit warranty, we do use them in production for Dzango and our customers.

Dzango has published so far 2 library of Terraform modules:

dzangolab/terraform-digitalocean-modules for provisioning DigitalOcean resources
dzangolab/terraform-aws-modules for provisioning AWSresources

If you can’t find a module that suits your requirements, they are easy enough to create. Most likely you will end up building your own library of module, that you can reuse across your various infrastructure stacks.

Test

Terraform provides a syntax to write automated tests for modules. Of course, writing tests for your modules would be the best practice. But, if devops is not your full-time job, and you do not plan to turn your modules into a commercial product, you may skip these automated tests (we unashamedly do).

However, you should not skimp over manually testing your IaC code.

It’s somewhat like backups: backing up is easy; what matters is being able to restore from a backup.

Here what you really want is to run your IaC from scratch at any time in case you need it (eg your stack is down and you are under pressure to bring it back online again).

In order to achieve that level of confidence, I always do at least 1 destroy/provision cycle for each resource. For each stack, once it’s ready, I like to do a complete destroy/provision cycle. I use that exercise to check my documentation. Are all the steps recorded?

Copy/paste

Your different stacks are going to share a lot of common elements. Most likely, you will be using a single cloud provider; most likely you will always use the same Linux distribution for your servers. Your staging stack is going to look very much like your production stack (that’s the wholepoint of staging!).

You can, and you should, copy/paste from one stack to the another. Typically, you’ll build your staging stack first, and once it works, copy/paste its code to the production folder, and make the necessary adjustments. If you later make changes to one stack, you’ll likely want to copy/paste the same changes to the other.

Other stacks (gitlab runners, ELK, etc.) may be different, but still would share a large common base. Once you have something working, re-use it, and make incremental changes to achieve the result you want. This cuts down the leraning curve quite a lot.

Terraform is a fantastic tool, which provides great value. Devops is complex, and even a tool like Terraform cannot eliminate all that complexity. I hope these tips will work foryou as well as they do for me, or at least encourage you to think of your own best practices to enjoy the benefits of devops without too much of its headaches.