Upgrading to Terraform 0.12 and Terragrunt 0.19+ at Peloton

Jaron Summers
Peloton-Engineering

--

We rely on Terraform and Terragrunt to manage all of our AWS infrastructure as code. This infrastructure directly impacts many areas of the Peloton experience that our members know and love, like our leaderboard that enables social interaction and competition between members, and even our CMS tools that help us deliver our world-class fitness content to more than two million people.

In total, we have about seven separate “environments,” which are discrete sets of Terraform code that are meant to generate a complete environment, such as the build/management infrastructure, production, or staging. Each environment has tens or hundreds of Terragrunt configuration files that instantiate one or more Terraform modules. We rely on Terragrunt to manage some hooks that run before the “plan” stage and render environment-specific values into more general modules. Here is a simple example: we might have a module that creates an AWS Auto Scaling Group with predefined settings that we could then instantiate with one small instance for staging and 10 large instances for production.

Until recently, we were using version 0.11 of Terraform and version 0.17 of Terragrunt. The upgrade to Terraform 0.12 — and with it, the upgrade to Terragrunt 0.19+ — is a substantial one, because it comes with a new version of the configuration language. In the long run, this upgrade has enabled the team to work faster, with fewer steps in the process, so we can continue developing new features and, ultimately, continue improving the experience for members.

Method

One of the most important bits of prep work required to upgrade is ensuring that Terraform and all required providers are as up to date as possible, because the earlier 0.11.x series Terraform releases do not include a lot of the upgrade helpers that will ensure that remote state can be safely upgraded.

Because we use Terragrunt, there are actually two components of the upgrade: upgrading the Terraform modules and upgrading the Terragrunt configs. Both sets of changes must be made as a matching set, so we made the choice to block all changes to our Terraform repositories for a few days so the environment would not drift while all of the code changes were being made.

Terraform has something called a tfvars file, which is just a file that will be read in as part of the plan to define any variables that are contained therein. Terragrunt previously relied on that functionality to configure itself, but has converted to using a separate HCL2 (Hashicorp Configuration Language, version 2) file. Conceptually, the Terragrunt upgrade requires a conversion from a tfvars file to an HCL2 file. Our Terragrunt files were pretty consistently laid out, so we were able to hack together this quick parser to make the vast majority of the required changes:

import os

files_to_update = list()
for r, d, f in os.walk(tf_dir):
for i in files_in_dir: #
if i == "terraform.tfvars":
# this naming scheme is arbitrary, it’s what we chose to name our terragrunt configuration files, it just has to be a *.tfvars file
files_to_update.append(r)
# if we find a tfvars file, we store the root directory for processing

for f in files_to_update:
with open(f + "/terraform.tfvars", 'r') as i:
lines = i.readlines()

lbrace = 0 # this is a counter to keep track of whether we have matched brackets right now or not
closing_brace_idx = None
done = False
for idx in range(len(lines)):
if "{" in lines[idx]:
lbrace += 1
if "}" in lines[idx]:
lbrace -= 1
if lbrace == 0:
closing_brace_idx = idx
if "terragrunt = {" in lines[idx]:
# terragrunt configurations no longer need their block to be demarcated as such, so we remove this attribute definition
tg_idx = idx
if closing_brace_idx and idx > closing_brace_idx + 1 and not done:
lines[closing_brace_idx] = "inputs = {"
lines.append("}")
done = True
if not done:
lines.pop(closing_brace_idx)
lines.pop(tg_idx)
# we don’t delete the terragrunt attribute declaration until the end to avoid mutating the list while we’re iterating over it
with open(f + "/terragrunt.hcl", "w") as x:
x.writelines(lines) # once changes are done, we just write out a brand new file with the new desired name...
os.remove(f + "/terraform.tfvars")
# ...and delete the old one

This reads in each file, removes a now-obsolete attribute declaration, and also encases input variables in their own block, if any exist. There are a lot of use cases that this does not cover but could be easily customized. This approach has the potentially noteworthy problem of eliminating your git history, because it will be interpreted by git as a deletion of all of the terraform.tfvars files and a creation of seemingly unrelated terragrunt.hcl files. Moving the rename and modification into separate commits would potentially retain commit history, but we didn’t consider this critical.

If your workflow does not include Terragrunt, the process is effectively the same, but excluding any Terragrunt-specific steps.

The Terraform upgrade is much more in-depth because some core functionality of the configuration language fundamentally changed. Fortunately, the Terraform binary can actually do a lot of the work for you, even if your use-case is fairly complex. Begin with a complete cache of all required plugins and then create a quick script to iterate over each module, run `terraform init` and then `terraform 0.12upgrade`. Once that finishes running, all of the code will be nominally upgraded to be 0.12 compliant, though in practice, a lot of things still need some massaging.

Once these code changes were complete, we duplicated all of our Terraform state buckets, in case we needed to roll back, and laid out a plan for how to plan and apply all of our environments. We began with individual, low-impact configurations in lesser-used environments and worked up towards completely applying the plans to production and management services. This approach gave us maximum confidence that the changes were safe by the time we reached environments that would impact the member experience.

It’s worth noting the necessity to apply all plans, even those that seemingly make no changes, becauseTerraform makes changes to the state data to upgrade it for 0.12 in the background .

Gotchas

There were a couple of things that the upgrade tool fairly reliably missed that required minimal time to correct.

List type hinting was a big one.In Terraform 0.11, there are many situations where it is necessary to hint to the parser that a variable will be a list by wrapping it in square brackets, but Terraform 0.12 handles this situation like any other programming language, and so will yield a list of lists, which is usually not what you were trying to create.

With a variable called “my_list” with a value of [1, 2]:# Old way:
some_param = [“${var.my_list}”]
# New way:
some_param = var.my_list

Related to the above, the introduction of a more robust type system created a lot of situations where the “old way” no longer worked. The type system in Terraform 0.12 is a huge benefit, but getting there required a lot of preliminary work. We recommend taking the time to determine the most narrowly scoped types you can, as it can prevent several of issues later on. For example, if a module won’t work as expected unless a variable is of the structure list(object({thing = string, other_thing = number})) then it’s better to explicitly state that than just fall back to list(any) because the Terraform executable will give an easy-to-understand warning when a data structure mismatch occurs.

We also lost the use of some minor integrations and tools we were using, including terraform-docs, though it has recently been updated to support Terraform 0.12, which we relied on to auto-render basic readme files that contained inputs and outputs, along with their descriptions and defaults for each module.

Terraform 0.11 does not have a boolean type; it just does its best to cast certain magic strings into the numbers 0 and 1. Terraform 0.12 adds true boolean support, which did not burn us but easily could have. The removal of “truthy” evaluation means that “false” and “0” will now evaluate to True, because the string is defined and has content.

Lessons Learned

If we had to do it again, the only thing about this process that we would do differently would be to aim for the 0.12 upgrade to result in zero changes being made to our AWS infrastructure, and instead, note all of those instances of “cruft” to be revisited and corrected either before or after the upgrade.

This ultimately worked out, but as we delved deep into some Terraform modules that had not been changed in quite a while, it caused us to spend time, correcting many little things (e.g. outdated email addresses, inconsistent variable names, formatting issues, etc.) that were a net gain for the maintainability of the repo, but ended up clogging up the git diff and terraform plans in a way that was not necessary.

Benefits

Having a true type system is great, particularly for people who don’t regularly interact with Terraform or AWS’ APIs. It’s much more developer friendly to be able to fail early in the plan with a clear error message indicating the type of data that was expected, which is partly a result of the expanded type system — and partly a result of Terraform 0.12’s greatly improved error handling in general. Where Terraform 0.11 would have said “I didn’t work, lol!”, Terraform 0.12 will tell you that you provided a list(list(number)) when it was expecting a list(number) for the variable called “list_of_numbers”, which makes it hugely more approachable for software engineers outside of our platform team.

Beyond making Terraform more accessible to the rest of the organization, a big thing that we gained is dynamic blocks, which let us replace a few really gnarly pre-rendering scripts that were actually reading in tfvars and writing out terraform files before the plan would run. Dynamic blocks are hugely powerful. A simple example is something like this:

dynamic "filter" {
for_each = lookup(rules.value, "filter", [])
content {
prefix = lookup(filter.value, "prefix", null)
tags = lookup(filter.value, "tags", null)
}
}

This will generate a filter{} block for each entry in rules.value, a list of maps, and then set prefix and tags, but only if each of those keys exists inside the appropriate map. These can even be nested, allowing for some fairly complex things to be done dynamically.

Finishing Up

Overall, there was a lot of toil in preparing for this upgrade. We spent about 25 hours of time hand-massaging modules into shape after the upgrade, but the roll out of the upgrade was almost completely painless. It gave us some immediate benefits by solving a couple of specific problems we had that needed dynamic blocks, and it will allow us to maintain a cleaner code base that is more accessible to product developers.

Our next big Terraform projects relate to our push to achieve true continuous delivery:

  • Combining our separate Terraform and Terragrunt repos
  • Experimenting with only merging code that is already applied, thus validating that it is error-free, idempotent Terraform
  • Testing with tflint and/or terratest to get to a point where we’re comfortable with auto-applying any changes that are merged.

More Information

Hashicorp docs on upgrading to Terraform 0.12: https://www.terraform.io/upgrade-guides/0-12.html

Gruntwork docs on upgrading to Terragrunt 0.19+: https://github.com/gruntwork-io/terragrunt/blob/master/_docs/migration_guides/upgrading_to_terragrunt_0.19.x.md

Thanks

Jonathan Cooper and Tipene Moss for all of the help with the upgrade process, both during planning and during the full day of huddling around my desk planning and applying.

Tyler Gass, Devon Mizelle, Laura Barber, Shawn Tolidano, Alec Booker, and Jonathan Cooper (again) for help editing this post.

--

--