Terraform in Real Life: Lessons Learned

Shane Mitchell
Version 1
Published in
8 min readSep 21, 2021

As an AWS Consultant with Version 1, Infrastructure as Code plays a huge part in my role. As we are a multi-cloud partner, our IaC tool of choice is Terraform. In addition to supporting many providers (including all of the major clouds), Terraform offers many benefits for platform engineers including centralised state management, version controllable modules, and some advanced language functions.

I’ve been using the Hashicorp tool for over 4 years now for both AWS and Azure, and am a certified Terraform Associate. I wanted to use this post to outline some of the mistakes/gotchas/best practices I have experienced.

The below isn’t a comprehensive best practice guide. Hashicorp provides great material on how to use Terraform and their recommended practices, it’s worth checking that out if you haven’t already. The following are some of my tips based on my experiences.

Be really familiar with provider docs (and changelog)

If you’re new to Terraform and plan to be using it more, you absolutely need to know how to navigate the terraform provider docs. For example, if you’re going to be building in AWS, you’ll be spending a lot of time here. At the time of writing, Terraform supports 140 AWS services — each with one or many resources which have many attributes. And that’s just for AWS.

While nobody will be using every single resource, you will undoubtedly need to know what options the resources you do use have. The provider docs are your bible for this. Anytime a provider is updated the docs are updated as part of the release, so they’ll also reflect the latest available resources and attributes.

Another handy source of information is the provider changelogs. Again, these are updated with each provider release. They are a useful way to check recent changes to resources, which can sometimes lead to unwanted changes in your code, or even breaking your Terraform plans. This can be mitigated by pinning your provider versions (see below), but sometimes you need to use the latest provider to leverage new releases from your favourite cloud. For example, here’s the AWS provider changelog, I find it handy to search this page for a particular resource (e.g. “aws_ec2_transit_gateway”) to look for any recent changes.

Version your Terraform (and providers)

As recommended by Hashicorp, it is a best practice to add constraints to both Terraform and provider versions. While the versions of both Terraform and each provider used in your configuration should be pinned (constrained), they can affect your code in different ways:

Terraform Version: terraform commands ( init|plan|apply etc) use the version of Terraform installed on the host it runs. Therefore the version doesn’t change unless you reinstall Terraform (or use tfswitch — see below). As you start working on larger projects with other engineers, however, the requirement for consistency across desktops is clear. By setting the required version for Terraform in the configuration, all developers using that configuration must have the correct version of Terraform installed on their machine.

terraform {
required_version = "~> 1.0.5" # >= 1.0.5, < 1.1.0
}

In the above example, you can see that I used a ~ (tilda) as part of the version constraint, tilda allows new patch releases within a specific minor release, so in this case, we can use any version between 1.0.5 and the next major release (1.1.0).

Providers: unlike Terraform which is already installed on the host, providers are pulled during terraform init . The default behaviour is to pull the latest versions of each provider in your code. As mentioned above, the latest isn’t always what you want as it can introduce changes to existing resources you have built.

Terraform provides two easy ways to manage your provider versions, the traditional way is to use constraints in your terraform block — the same place you set the Terraform version:

terraform {
required_version = "~> 1.0.5" # >= 1.0.5, < 1.1.0
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.0" # >= 3.0 < 4.0
}
}
}

Use for_each instead of count

Before for_each was introduced in Terraform 12, count was the only way to create multiple similar objects from a single resource block. The downside to using count is that it creates a list of resources that are difficult to modify. Take the following example:

variable "subnets" {
default = ["10.1.11.0/24", "10.1.12.0/24"]
}
resource "aws_subnet" "this" {
count = length(var.subnets)
cidr_block = var.subnets[count.index]
vpc_id = var.vpc_id
}

The above code will create two subnets — one for each cidr in our list:

Terraform will perform the following actions:# aws_subnet.main[0] will be created
+ resource "aws_subnet" "main" {
...
+ cidr_block = "10.1.11.0/24"
+ tags = {
+ "Name" = "subnet-10.1.11.0/24"
}
...
}
# aws_subnet.main[1] will be created
+ resource "aws_subnet" "main" {
...
+ cidr_block = "10.1.12.0/24"
+ tags = {
+ "Name" = "subnet-10.1.12.0/24"
}
...
}
Plan: 2 to add, 0 to change, 0 to destroy.

Now if we want to add a third subnet, with a cidr block before the others, the subnets var looks like this:

variable "subnets" {
default = ["10.1.10.0/24","10.1.11.0/24", "10.1.12.0/24"]
}

This will result in the following plan:

Plan: 3 to add, 0 to change, 2 to destroy.

As we can see from this plan, our 2 original subnets will be destroyed and recreated. This is the limitation of using count to create a list of resources. The solution is to use Terraform’s for_each instead:

variable "subnets" {
default = ["10.1.11.0/24", "10.1.12.0/24"]
}
resource "aws_subnet" "main" {
for_each = toset(var.subnets) # toset converts var.subnets to a set, which is required when using for_each on a list
vpc_id = var.vpc_id
cidr_block = each.value
tags = {
Name = "subnet-${each.value}"
}
}

When using for_each and we add an item to the start or end of list, we just get 1 change in our plan which is what we want:

variable "subnets" {
default = ["10.1.10.0/24","10.1.11.0/24", "10.1.12.0/24"]
}
...Plan: 1 to add, 0 to change, 0 to destroy.

The above example highlights that using count can introduce complications down the line when updating resources. Therefore is a good practice to use for_each for the list of resources and keeping count for simple groups:

If your instances are almost identical, count is appropriate. If some of their arguments need distinct values that can't be directly derived from an integer, it's safer to use for_each.

Use (and secure) Remote States

One of the key features of Terraform over other IaC tools is its state management. Terraform uses state files to remember the expected state of resources it manages. This allows us both to maintain and update environments.

This state is used by Terraform to map real world resources to your configuration, keep track of metadata, and to improve performance for large infrastructures.

By default, Terraform stores state in a local file called terraform.tfstate . While this is fine for quick demo/POC deployments, it is not recommended for production builds, particularly those managed by teams. Each developer needs to have access to the latest version of the statefile in order to avoid unplanned updates to existing resources.

Terraform provides Remote backends for storing centralised statefiles to allow collaboration and versioning. For example, in AWS remote state can be stored in S3 or in a Postgres database running on RDS. One of the advantages of using a Postgres database over S3 is that it offers state locking which prevents concurrent updates of the statefile (e.g. from multiple developers or pipelines).

An important point to note on statefiles is that they contain all attributes of your resources, and can include secrets and other sensitive data. Therefore it is a good practice to secure your remote backend like you would for a database or another sensitive data store. For example, for S3 we recommend encrypting data at rest using SSE, and limiting user and network access using IAM and Bucket policies. In addition, I would recommend enabling bucket versioning to allow recovery in case of statefile corruption or deletion.

Go all-in on Terraform

As described in the previous section, Terraform is stateful — it tracks changes made to your infrastructure whether it does them or not.

For example, if Terraform creates a security group with some tags, and a sysadmin then manually changes those tags through the console… the next time Terraform applies it will update the tags to the original values. While this scenario doesn’t sound disastrous, it highlights the issue that will occur by making changes to resources, managed by Terraform, elsewhere.

If, instead of tags, the sysadmin changed something with more consequence such as instance_type to a larger instance, and Terraform then reverts it on the next apply it can cause major issues to your application. Therefore it is recommended to go all-in on Terraform and avoid combining IaC with manual changes.

Terraform does provide commands for modifying its state ( rm | import | taint ) which can be leveraged to make Terraform “forget” or “discover” resources. However, from experience, I have found that these processes can be complicated and messy, and should be used as a last resort. My tip is — if you create it with Terraform — manage it with Terraform.

Terraform can’t do everything… or at least it shouldn’t

While Terraform is a great tool for building infrastructure, it is not ideal for every situation. It can be tempting to put everything in your .tf files and use resources like aws_cloudformation_stack , provisioners such as local-exec and remote-exec, and other ways of running non-native scripts. Going down this route, you can easily find yourself in “square peg round hole” territory.

Generally, when you stretch the boundaries of Terraform you start to compromise the benefits it offers — such as updating existing resources. I’m not saying never use trigger other scripts with Terraform, it makes sense in some scenarios, such as when you already have a full stack defined in CFN but everything else is deployed with Terraform. As a rule of thumb, I recommend using native Terraform resources where possible and consider using other tools for what they’re good at such as Ansible for configuring instances and pipelines for deploying applications.

Use Tools

My last tip for this post is to make use of the Terraform tools that are out there. As Terraform has gained popularity, more and more open-source helper tools have become available. As mentioned in the modules point above, it is essential to enforce standardisation as your Terraform footprint grows. Many of these tools will help with implementing best practices, especially when included in your CI process. Below are some of my favourites, and is in no way an exhaustive list:

  • terraform fmt: built-in Terraform command to apply recommended coding conventions
  • tfswitch: handy tool for managing and switching between Terraform versions on your workstation
  • tflint: plugin used to enforce best practices (such as naming conventions)
  • tfsec: static analysis of your terraform templates to spot potential security issues (such as hardcoded passwords)
  • terratest: automated testing library to validate that infrastructure works correctly in that environment by making HTTP requests, API calls, SSH connections, etc.
  • terraform_docs: package used to generate documentation for Terraform modules
  • checkov: static code analysis for IaC tools, which comes with a well defined collection of best practice checks for AWS and Azure , as well as support for custom rules
  • pre-commit: collection of git hooks that force/remind the developer to do certain actions (e.g. fmt, lint, docs) before pushing Terraform code

Hopefully you found some of these tips helpful for your own adventure with Terraform. There were many topics that I didn’t cover in this blog, including modules and CI/CD. I hope to cover those in future posts. If you have any questions, or tips of your own, or tool recommendations to share, please post them in the comments.

--

--