Terraform Code Layout and Using Terragrunt

14 min readOct 10, 2021

Terraform lets you organise your code however you want.

This gives you lots of flexibility and makes it easy to get started by just throwing a few resources into a file and running terraform apply

But as your environment grows, you’ll want to be more disciplined about how you structure your code.

This post talks about how we started with a simple freeform layout of files in a single folder, and then started moving to Terragrunt to address some of the issues that brought up.

The first thing you’ll need when working with Terraform is to manage the state file.

State Management

Terraform state is where Terraform stores information about the resources you’ve created with Terraform.

If you’re a single developer you could just keep the terraform.tfstate file locally on your machine (not checked into Git as it can contain potentially sensitive information) but when working on a team you need two things:

A way to share this state file securely
A way to prevent multiple developers modifying the state file at the same time

Thankfully the s3 backend type solves all of these problems — https://www.terraform.io/docs/language/settings/backends/s3.html

This stores the terraform.tfstate file in an S3 bucket, and uses DynamoDB to obtain an exclusive lock before performing actions that need the state file.

This does raise some new problems though, how do you initially create the S3 bucket and DynamoDB table?

There’s numerous solutions, you could do it with Cloudformation, or manually.

But we wanted to keep it all in Terraform, so we have a module that creates an S3 bucket and a DynamoDB table.

This is instantiated once in each environment in a bootstrap folder to give us environment specific S3 buckets and DynamoDB tables for the “main” Terraform State and the Terraform state for those 2 bootstrap resources is stored in Git.

These 2 initial resources are never modified after the initial creation so we’re not worried about having a lock to protect this state from concurrent modification and it doesn’t contain any sensitive date so we’re happy to commit it to Git.

How to Layout the Terraform Code

With the state file management sorted, you can move onto defining some resources you need to create.

One place to start is here:

terraform/
  dev/
    bootstrap/
      bootstrap.tf
      terraform.tfstate  staging/
    bootstrap/
      bootstrap.tf
      terraform.tfstate  prod/
    bootstrap/
      bootstrap.tf
      terraform.tfstate

Each of these folders is a seperate Terraform state and so if you make a mistake terraform applying in one environment it won't affect the others.

Inside each folder you’ll then need to decide how to split files containing module and resource definitions.

You could have just have one giant terraform.tf file and put every single definition in that.

But you’re going to get a lot of merge conflicts and find it nearly impossible to locate where specific resources are defined.

So a sensible layout is something like:

/terraform
   dev/
     application_a.tf
        -> resources / modules for app A
     application_b.tf
        -> resources / module for app B
     vpc.tf
        -> VPC / Subnets / NAT Gateways etc.
     eks.tf
        -> K8s Cluster

Tweak for your exact set of technologies but you get the idea.

Variables

The next thing you’ll want is a way to define variables, the modules for app A probably need to take the name of App A so they can tag resources / add identifiers so you’ll put a locals block in application_a.tf

locals {
    app_a_name = "A"
}

And then in the module definition reference it

module app_a_resources {
    app_name = local.app_a_name
}

Note that every file in the dev/ directory will have access to this locals block so you can’t just call the variable app_name

You’ll probably also have some variables that need to be the same across all your apps, for example maybe your company has agreements on when maintenance windows are for RDS instances.

So you either copy paste the window into the module definition for every database across all the apps, or define it as a local somewhere and reference the local in every module that needs it.

Issues with this Approach

This will probably work well for some time, but you’ll notice some issues start to creep in.

As your entire environment is a single Terraform state terraform operations such as plan and apply will start taking longer and longer as they process every resource in the environment.
As your terraform repo grows the layout is going to become less consistent.

People are going to name files e_resources.tf rather than application_e.tf

There going to forget / not be able to find the local definition for maintainence_window = "mon12:13" and create new local definitions such as maint_window or just hardcode a string in as the variable.

People are going to create Terraform modules with subtle variations on variable names, app_name will be app in some places, subnet_id will be subnet in others.

This isn’t a massive pain you can just remap the local name to match in the module definition.

module a_resources {
   app = local.app_name
}

But as people try to read / modify and extend the code its going to cause mistakes as people don’t catch the subtle variable name changes.

Eventually you’ll have another AWS region, and now you need the maintainence_window to be different to account for time zones and your customers being active at different times.

No worries, you create eu_maintainence_window and in all the eu module definitions write

module eu_a_resources {
   maintainence_window = local.eu_maintainence_window
}

One day someone will copy paste the maintainence_window definition when creating a EU module and forget to change it, so the database reboots during peak time for your EU customers.

Ideally you’d rename maintainence_window to us_maintainence_window everywhere, but in reality time pressures normally mean these refactors aren’t fully completed.

3. Your list of .tf files in a single directory is now getting quite large, you have to grep to find what you’re looking for quite often.

4. One day someone quickly defines an extra resource for app_c in devoutside of the app_c_resources module and directly in the dev/app_c.tf file, so they can quickly test something.

They forget to ever move the resource into the module and so when app_c goes live it’s missing a resource or changes haven’t been copy pasted across from dev/app_c.tf to prod/app_c.tf

Most of these problems can be addressed, rigorous code reviews and commitments to refactoring when variable names drift you could keep your Terraform repo neat and tidy.

The ever increasing size of your Terraform state is going to slow things down though.

Also remember only one person can obtain the lock to modify the state file so you may end up blocked waiting for a turn to run terraform apply

Finally if you accidentally corrupt the state, that’s a lot of resources potentially affected (make sure you have S3 versioning enabled!)

To address these issues we started using Terragrunt which has worked out really nicely.

Terragrunt

For a long time I struggled to understand what the point of Terragrunt was or what exactly it did.

The best way I can explain it is that Terragrunt places restraints on how you can organise your Terraform code, and forces you to use directory structure hierarchies and shared variable definition files to organise your code.

These restraints force your code to be more consistent and makes it harder to make mistakes, whilst reducing the amount of flexibility you have.

I don’t think I could have ever truly appreciated Terragrunt without first going through the above and seeing the issues with just being able to go freestyle.

Logically Organising Your Infrastructure

To use Terragrunt you must decide how your infrastructure can be logically broken down into smaller and smaller groups.

E.g. at the top level you could have an environment (e.g. dev), then a region (e.g. us-east-1), then an application (e.g. app-a), and then finally pieces of infrastructure that application needs (e.g. database, cache, s3 bucket).

And so you’ll have a folder structure something like

dev/
  -> us-east-1/
    -> applications/
      -> app-a
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl

Your directory structure represents how your infrastructure is organised.

The terragrunt.hcl files are what Terragrunt reads to understand what Terraform module to apply, more on those later.

Now then what if we add another app that just needs an IAM Role?

dev/
  -> us-east-1/
    -> applications/
      -> app-a
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl
      -> app-b/
        -> iam-role/
          -> terragrunt.hcl

This is great, but what about resources shared across a region like an EKS Cluster?

You can put folders where ever you like, so lets make some at the environment level.

dev/
  -> us-east-1/
    -> eks-cluster/
      -> terragrunt.hcl    -> applications/
      -> app-a
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl
      -> app-b/
        -> iam-role
          -> terragrunt.hcl

Now why did we place app-a and app-b in a folder called applications ?

Couldn’t they have just been under us-east-1 ?

They could have, but if you have a lot of applications in your company, being able to group all of them in one sub folder might be nicer to reason about, and it gives you the option to have a shared values file exclusive to all your applications.

Values Files

Let's talk about shared values files, another big feature of Terragrunt.

Every Terraform module will take input variables that control the exact details of the resources it creates.

Some of these input variables are likely to be the same across all resources deployed in an environment, e.g. environment_name for things like tagging.

Rather than writing environment_name=dev in every single terragrunt.hcl file, lets define all those environment level variables in a file called environment.yaml

dev/
  -> environment.yaml  -> us-east-1/
    -> eks-cluster/
      -> terragrunt.hcl  -> applications/
      -> app-a/
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl
      -> app-b/
        -> iam-role/
          -> terragrunt.hcl

And environment.yaml will look like:

environment_name: dev

Next we have some region level settings, for example in us-east-1 we might want our maintainence_window for RDS instances to be during mon05:00-07:00 a quiet time for our US customers.

dev/
  -> environment.yaml  -> us-east-1/
    -> region.yaml  -> eks-cluster/
      -> terragrunt.hcl  -> applications/
      -> app-a/
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl
      -> app-b/
        -> iam-role/
          -> terragrunt.hcl

And region.yaml will look like

maintainence_window: mon05:00-07:00

Now if we expanded into the EU and the quiet time for our EU customers was tue21:00-22:00 we could do something like:

dev/
  -> environment.yaml  -> us-east-1/
    -> region.yaml<SNIP>
  
  -> eu-west-1/
    -> region.yaml
  -> applications/
      -> app-c/
        -> database/
          -> terragrunt.hcl

And eu-west-1/region.yaml would look like

maintainence_window: tue21:00-22:00

I’m sure you’re starting to get the idea now, you can easily propagate shared variable down the sub trees to easily be able to customise your infrastructure without having to copy paste values everywhere.

The final shared variable we’ll create will be for each application, the name of the app is likely to appear somewhere in the resources for it.

dev/
  -> environment.yaml  -> us-east-1/
    -> region.yaml
    -> eks-cluster/
      -> terragrunt.hcl  -> applications/
      -> app-a/
        -> app.yaml
        -> database/
          -> terragrunt.hcl
        -> cache/
          -> terragrunt.hcl
        -> s3-bucket/
          -> terragrunt.hcl
      -> app-b/
        -> app.yaml
        -> iam-role/
          -> terragrunt.hcl

And each app.yaml will look something like:

app_name: <NAME>

Now then, not every input variable will be generic, there’s going to be some specific to individual modules.

For those you can supply them directly to the module.

Lets look at what terragrunt.hcl actually is:

terraform {
  source = "Link to Terraform Module on Github"
}include {
  path = find_in_parent_folders()
}inputs = {
  module_specific_variable = "amazing"
}

It’s a fairly simple file.

You’ll need a link to the module you want to apply, it only works with Github links so no Terraform Module Registry references here.

Ignore include for now, it’s related to finding those shared variable files, we’ll come back to that

inputs is where you can pass any inputs to modules.

If an input variable has the same name as a variable defined in one of our shared variable .yaml files then it will be automatically picked up.

If the variable doesn’t come from a shared definition file you can enter it manually here.

There’s a few more things to mention about input variables whilst we’re here.

We’ll configure Terragrunt to use the first definition of a variable that it finds so that we can override generic values supplied by the .yaml files with more specific ones further down the tree.

You’ll probably also find situations where you’ve called the generic variable something like environment_name but some modules expect it to be called environment or env_name .

This is a nice feature of Terragrunt because it will quickly become clear where your modules aren’t consistent and you can work to bring things inline over time.

In the short term you can remap the generic variable names to the more specific module ones.

locals {
  env_vars = yamldecode(
    file("${find_in_parent_folders("environment.yaml")}"),
  )
}inputs {
  env_name = local.env_vars['environment_name']
}

Dependencies

Finally, it’s likely that the input to one module is going to be an output from another.

For example let's say our eks-cluster module outputs worker-sg-id the ID of the Security Group the K8s workers use.

Then our database module takes an input parameter of sg_to_allow_access_from the ID of a Security Group it will create an ingress rule for.

You can use the output of one module as an input to another like this …

dependency "k8s-worker-sg" {
  config_path = "../../eks-cluster"
}inputs = {
  sg_to_allow_access_from = dependency.k8s-worker-sg.outputs.worker-sg-id
}

That covers pretty much everything you’ll need to get started with laying out the Terragrunt files and not copy pasting your variables everywhere.

Terragrunt Configuration File

But we still can’t run terragrunt yet there’s some configuration needed to tie it all together.

Previously we’d seen that the Terraform state file encompassed every resource in our environment.

This made Terraform commands take ever increasing amounts of time to run as our environment grew, and we risked affecting every resource if we corrupted the state somehow.

In this Terragrunt setup we can create one state file per “leaf node” of the directory tree, so essentially where ever there’s a terragrunt.hcl file defining a module to be applied we create a new state file.

This makes Terraform operations super fast, and reduces the consequences of corrupting a state file.

We need to create a terragrunt.hcl file under dev/us-east-1/terragrunt.hcl and rather than defining a Terraform module to apply, it defines all our Terragrunt configuration that every other terragrunt.hcl file imports with the include statement we saw earlier

include {
  path = find_in_parent_folders()
}

find_in_parent_folders is a built in Terragrunt function that returns the first terragrunt.hcl file it finds in the parent folders.

So let's start our dev/us-east-1/terragrunt.hcl file and define our state configuration

remote_state {
  backend = "s3"
  config = {
    bucket         = "S3 BUCKET NAME"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "AWS REGION"
    encrypt        = true
    dynamodb_table = "DYNAMO DB TABLE"
  }
}

This uses S3 and DynamoDB to store state / obtain exclusive locks on the state file like we saw before without Terragrunt.

But the key = ... line means that inside the S3 Bucket they’ll be a directory structure mimicking your Terragrunt folder structure with state files.

Also, remember how in pure Terraform we had to do slightly hacky things to bootstrap the S3 Bucket and DynamoDB table?

Terragrunt will automatically create them if they don’t exist, solving that entire problem.

There is one downside to this though, all the Terraform modules you apply need to define

terraform {
  backend "s3" {}
}

In the module so that Terragrunt can fill in the details when it’s ran.

There’s a Github Issue thread from back in 2017 discussing this — https://github.com/gruntwork-io/terragrunt/issues/230

We now need to tell Terragrunt where to find all those shared variable files we defined.

inputs = merge(
  yamldecode(
    file("${find_in_parent_folders("environment.yaml", find_in_parent_folders("environment.yaml"))}"),
  ),
  yamldecode(
    file("${find_in_parent_folders("region.yaml", find_in_parent_folders("environment.yaml"))}"),
  ),
  yamldecode(
    file("${find_in_parent_folders("app.yaml", find_in_parent_folders("environment.yaml"))}"),
  ),
)

As before find_in_parent_folders causes Terragrunt to search from where the modules defined up the tree to find the first occurence of the file.

The second parameter to file is a default to use if it can’t find the file, here we fallback to environment.yaml which we will make sure always exist, and it means that if we have a situation where a module isn’t nested deeply enough to have all the app.yaml , region.yaml and environment.yaml files in it’s parent directories it doesn’t explode.

merge is the standard Terraform function — https://www.terraform.io/docs/language/functions/merge.html

It means that the any variables defined further down the tree override those further up.

One final configuration option and we’re done.

The AWS provider we use needs configuring, we can use Terragrunts ability to generate files to place an identical configuration in the working directory before Terraform is ran, rather than copy paste it 100’s of times.

generate "aws_provider" {
  path      = "aws_provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "us-east-1"
}
EOF
}

This will generate a file called aws_provider.tf in the working directory every time a command is ran that passes in our required options.

We can now run Terragrunt commands and build our infrastructure.

There are two options for us to run Terragrunt commands:

Across multiple folders, allowing us to build multiple modules at the same time
In a single module for faster / more targeted applies.

run-all commands

If you want to run commands across multiple modules at once the run-all commands can do that.

For example, if you run terragrunt run-all plan in the dev directory it will run terraform plan in every subdirectory and present you with the plan.

The first time you run commands in a folder it will check if the Terraform state files and lock tables exist and prompt you to create them if not.

Single Module Applies

If you just want to apply changes to a single Terraform module you can cd into that directory and run terragrunt <COMMAND> and it will only apply to the current working directory.

Module Caching

One annoying quirk of Terragrunt is that once a module has been downloaded it won’t refetch it if the source changes.

For example if you have ref = github.com/module?ref=my-branch and you push new changes to my-branch after having ran terragrunt apply it won’t notice the source has changed and fetch the new changes.

You’ll need to clear the Terragrunt cache with find . -type d -name ".terragrunt-cache" -prune -exec rm -rf {} \;

Summary

We’ve now got a fully working Terragrunt setup, to recap what we’ve gained:

Small per module state files that make it faster to run terraform commands and reduce the impact of losing / corrupting a single state file
A well defined structure for laying out our infrastructure based on directory structures that mirror how the infrastructure is logically reasoned about
The ability to use shared variable files that allows variables to be defined at multiple levels and propagated down to the Terraform modules automatically to easily keep settings consistent at multiple levels, with the ability to override more generic values if required
The ability to declare that one module depends on another, creating the modules in the appropriate order and passing output variables between them
The inability to create random adhoc resources outside of a versioned Terraform module so we reduce the chances of resources / changes not being propagated to other environments
The ability to automatically retry Terraform commands if certain errors occur, reducing the impact of flaky / eventually consistent APIs

It’s not perfect though some limitations / issues:

Every Terraform module you use needs to define a blank backend block, meaning you may have to go modify every module you have, and you can’t use community modules
You can’t use Terraform module registry references, losing the ability to specify loosely locked versions and potentially meaning you need to provide Git authentication credentials so Terragrunt can pull modules from Github
You and everyone else on the team need to learn about Terragrunt
You now have another tool to keep up to date along side Terraform
If you haven’t been particularly disciplined about keeping your module input variables named consistently you’ll not get a lot of the benefits of the shared variable files without some refactoring

Example Repo

There’s a lot of code blocks in this post, you can see them all together in an example repo here — https://github.com/AaronKalair/example-terragrunt-repo