Murdering monoliths: using Terragrunt to split monolithic Terraform state up into multiple stacks

Chris Cunningham
Qodea Google Cloud Tech Blog
5 min readAug 3, 2022

Terraform is a fairly mature tool at this point, and for small to medium-sized estates can be used by itself to maintain your entire cloud footprint. Features like workspaces mean that you can even, with a little forward planning, have separate and isolated environments that can share infrastructure code but use different variables. But say that you’ve built out a popular platform, and the rest of your business is rushing to get on board. Using a single stack to hold all of your infrastructure logic begins to show scaling issues. Plans begin to take several minutes. Eventually, if unchecked, you can end up in a situation where you spend more time waiting for plans to complete than you do developing:

Two programmers having a toy sword fight instead of working. Their boss yells “get back to work!” and gets the response “compiling!”, to which the boss replies “oh, carry on”.
With apologies to Randall Munroe

Say that you have a large number of modules, all called at the top level by a root module. Let’s say our code looks like this:

main.tf  # root module, calls everything else
variables.tf
terraform.tfvars
team1/
main.tf
variables.tf
team2/
main.tf
variables.tf
team3/
main.tf
variables.tf
team4/
main.tf
variables.tf
external_team/
main.tf
variables.tf
databases/
main.tf
variables.tf
compute/
main.tf
variables.tf
kubernetes/
main.tf
variables.tf
....

main.tfis just a list of module()calls, passing any global variables into the subdirectories. (If this seems contrived, this layout is very easy to get to if your company’s cloud footprint is growing rapidly and “just one more module” will fit the day’s business needs!) You could just take each of the subdirectories and put it in a new repository. Problem solved, right?

Note that there are no outputs; because the state is monolithic, any resource can directly refer to any other. To split this up into multiple repositories would require outputs to be defined, and any module calls referring to those outputs to be rewritten. You also need to copy all of the tooling that you had in place for the original repo (such as CI/CD, pre-commit checks, etc), and all of those have to be maintained separately. Furthermore, the more new modules get added, the more repos you have to maintain, and changes which impact more than one module can be difficult to apply and coordinate.

What if there was an easy way to win back performance, reduce the time that it takes to apply small changes, and do it without radically overhauling your codebase?

Enter Terragrunt.

The Terragrunt logo, consisting of a shaggy monster with a friendly expression, wearing a construction helmet
What a friendly monster

Terragrunt is a wrapper for Terraform which aims to reduce duplicated code across large estates. While it has lots of additional features to help deal with large codebases, we’re only going to be looking at a basic implementation which will permit our monolithic state to be split out into separate stacks, but without overhauling its layout (for now).

We’re going to remove the root module and replace it with some Terragrunt configuration. Going forward, instead of running terraform plan (and apply), we’re going to use terragrunt run-all plan (and apply), which will look for any directories containing a terragrunt.hcl file and construct its own, higher-level graph of their resources. It can then process these in parallel, over and above Terraform’s own module-level parallelism. Furthermore, because modules aren’t explicitly dependent on one another any more, plans that only impact one directory can be run without having to inspect every resource. Finally, each directory gets its own state file, reducing the blast radius if something goes wrong with the state.

Let’s see what needs to be changed:

~main.tf~  # can be deleted
~variables.tf~ # can be deleted
~terraform.tfvars~ # can be deleted
terragrunt.hcl # replaces main.tf
root.yaml # replaces terraform.tfvars
team1/
main.tf
terragrunt.hcl # boilerplate
variables.tf
team2/
main.tf
terragrunt.hcl # boilerplate
variables.tf
...

Our global variables now live in root.yaml. This replaces the terraform.tfvars file previously used to hold global values. The terragrunt.hcl file in the project root has some magic in it:

vars = yamldecode(file(find_in_parent_folders("root.yaml", "root.yaml")))inputs = local.varsremote_state {
backend = "gcs"
config = {
bucket = "${local.vars.root_project}-tfstate"
prefix = path_relative_to_include()
project = local.vars.root_project
location = local.vars.region
}
}

inputs passes required variables intelligently to sub-modules. No more errors about providing unused variables! remote_state replaces the equivalent backend block in Terraform; using path_relative_to_include() here magically separates the different directories’ states out to different files. The find_in_parent_folders line is used to be able to locate the root.yaml file when configuration is being inherited, as below.

In each directory, only one extra file is needed: a terragrunt.hcl file to tell Terragrunt to walk through this directory. Here’s the contents:

include {
path = find_in_parent_folders()
}

That’s it! This inherits the parent terragrunt.hcl and with it, the inputs and state configuration it defined.

Finally, we want to migrate our existing state over to the new world of multi-stack programmatically if possible. Fortunately, Terraform state can be retrieved in JSON format, manipulated with the (awesome) jq tool, and then pushed to the new location by Terragrunt automatically. Here’s all that’s needed:

#! /bin/bash
terraform state pull > state.json
dirs=$(ls -d */ | sed 's:/**$::') # all directories in root
for dir in dirs; do
(
cd $dir
cat ../state.json | jq "del(.resources[] | select(.module | startswith(\"module.$dir.module\") | not ) )" |
sed "s/module.$dir.module/module/g" > "$dir.tfstate"
rm -rf .terraform
terragrunt state push $dir.tfstate
)
done

Let’s walk through this short but clever script. First, the existing state is downloaded to a local JSON file. Then, for each directory, the state is rewritten to remove any resources that don’t start with that directory’s name as the module being called. Then sed is used to chop off the level of module nesting that is no longer used to tie each module to the root. Finally, Terragrunt is run: it parses its configuration to get the new, directory-enhanced state path, and pushes the modified state to that location.

Voila: all resources should now have their own states. The root state file can be deleted, after ensuring that a full plan (running terragrunt run-all plan) from the root directory) shows no resources being changed.

Terragrunt can be used with Atlantis, per the documentation on custom workflows, and in general acts just like Terraform except for its enhanced configuration.

There are lots of extra tricks that can now be used to reduce duplication, support parallel environments, and deal with dependencies intelligently (including mocking during bootstrap), but with a few tweaks to an existing codebase it’s already been made more scalable, more robust, and far quicker to edit.

Thanks Terragrunt!

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--