Terraform at LumApps: Part 2

Managing terragrunt.hcl files at scale

Published in

LumApps Experts

9 min readDec 19, 2023

In the first article, we talked about who we were, our goal for the LumApps’ platform and how we organized and managed the infrastructure.

Timeline from 2018 to 2020 of the self-service deployment at LumApps

In the summer of ’20, we were very satisfied with the state of LumApps’ infrastructure. However we noticed we were spending more and more time doing repetitive tasks on behalf of the developers. Incidentally, it meant we were slowing down the velocity of every team each time they needed a change.

We started thinking about how we could provide the developers with self-service on their service’s infrastructure.

In LumApps’ context, “self-service” means the ability for developers to deploy, configure and maintain their services and their infrastructure by themselves using an abstraction layer.

In the words of Platform’s Director Jan Villeminot : “automation alone allows the Platform team to handle high-value tasks, but self-service will increase developers’ velocity and agency on the infrastructure”.

History

GitOps then self-service dates back to 2018 at LumApps.

At first it was a mix of self-service and GitOps. Our first guinea pig was the DNS repository. This repository was using Terraform and CircleCI to manage our DNS zones and records. Making a change meant opening a PR and looking for CircleCI’s output. Upon approval and merge, CircleCI applied the change.

In 2019, we went on GitOps-ing two other repositories because we (the Platform domain) were slowing down the development teams : our api-gateway and our ArgoCD repositories. Both repositories had huge amounts of repetitive and complex code. Our goal was then to provide some kind of abstraction layer (in the form of easier YAML code blocks) for the developers to configure themselves their applications’ api-gateway and deployments.

Then we tackled Github’s management : one of our repositories hosts all the resources needed for Terraform to manage all our Github objects (organization, teams and sub-teams, repositories and their configuration, branch protections, etc…). Every developer was able to open pull-requests but the Platform domain was responsible for applying them. To keep track of what Terraform did, we started applying the requested changes with CircleCI instead of our laptops. We were still impacting velocity (more on that later) but at least we gained some safety.

In 2021, we found out about Atlantis. Atlantis can be plugged on a Github repository using web hooks. It can listen to Github’s PR’s comments, plan and apply Terraform resources and write back its output inside a PR’s comment.

We used the repository holding our Github resources to build our first proof-of-concept using Atlantis. And it was a huge success for both the Platform domain and the development teams !

A developer is now able to open a PR and handle its lifecycle (except for the approval, of course) all by himself :

I have a new team mate ? I can add him to our organization.
I need my team to access another repository ? I can grant additional privileges to my team.
I want to make some CI checks required ? (You guessed it)
…

The Platform domain can easily review proposed changes along with Terraform’s plan’s output directly inside the PR. We don’t have to switch context to apply a change and we don’t impact velocity anymore.

What did we learn ?

Through this GitOps/self-service journey, we learnt a few important things that were of tremendous importance for the success of our project of providing infrastructure as a self-service function :

Making developers write Terraform code isn’t possible. Learning a new syntax for the sole purpose of a small and infrequent change isn’t a good thing;
Building opinionated infrastructure and tooling works well;
We can provide an abstraction layer over infrastructure-related code but it must be both simple and future-proof;
Relying on modified files in a PR isn’t enough to understand the entirety of the proposed change. Terraform’s output must appear in the pull request;

Aligning the stars

Early on, we had defined our requirements for this project. We needed (in no particular order):

A tool to automate the infrastructure deployment
A way to handle secrets securely
A way to hide sensible informations in the plan
To simplify our Terraform code
A tool to ease the configuration
A prototype to use as a sandbox

Requirements 1 and 6 have been covered earlier in this article.

Requirement 4 was the hardest to match. We did that by being tremendously smart 😜.

The Infra team finding solutions

Let’s focus on the other points.

A way to handle secrets securely

We don’t want to store our secrets in Git (even encrypted) but we need a way to have them stored safely, versioned, and audited. The best way for us to do this is with Hashicorp Vault.

A schema showing Terraform reading secrets from Hashicorp Vault and writing them into Kubernetes Secrets.

Terraform reads necessary secrets from one of our Vault cluster using a generic Vault datasource and put them inside every service’s Kubernetes secret.

Any developer can manipulate a map named additional_secret_entries to select which keys should be injected from the Vault cluster into the service’s secret :

additional_secret_entries = {
  "foo/bar" = [
    "baz",
  ]
}

In this case, it means the key named baz and its value will be pulled from the Vault cluster at the foo/bar path and stored inside the service’s Kubernetes secret.

The corresponding code for this behavior looks like this :

data "vault_generic_secret" "additional_secrets" {
  for_each = toset(keys(var.additional_secret_entries))
  path     = each.value
}

locals {
  # - We iterate on the path provided by the devs in the inputs
  # { path = [secret_key, secret_key2], path2 = [secret_key3, secret_key4] }
  # - We generate a list of maps with secrets contained in secret_keyN
  # - `...` is the spread operator. It transforms func([ a, b, c ]...) into func(a,b,c)
  additional_secrets = merge([
    for path in toset(keys(var.additional_secret_entries)) : {
      for key in var.additional_secret_entries[path] :
      key => data.vault_generic_secret.additional_secrets[path].data[key]
    }
  ]...)
}

resource "kubernetes_secret" "service_secret" {
  metadata {
    name      = "${local.service_name}-secrets"
    namespace = module.namespace.namespace_name
  }

  data = merge([local.additional_secrets]...)
}

A way to hide sensible informations in the plan

On earlier 0.x versions of Terraform, there was no way to explicitly tell Terraform that a given string was a secret. It was all in the hands of the provider to decide whether a resource’s attribute was always sensitive or never.

Jake didn’t like old Terraform versions

Fortunately, during our quest for a solution, the IaC’s gods heard our prayers (maybe sacrificing a few clusters made them listen) and provided us with the 0.15 release of Terraform !

This release brings the function sensitive() (and its friend nonsensitive ). That way, we can be sure that every secret’s element won’t be shown in a plan !

There is a little issue though… Since Terraform won’t show the content of a secret, there is no way to determine which key will be modified when reading the plan. This is an intended behavior but is quite scary.

To work around this, we used a little trick in our book : each value in the Kubernetes secret is hashed and stored in a configmap. The configmap is then made non-sensitive explicitly by Terraform.

This allows us to understand which key from the secret will be modified without showing the sensitive value in the plan !

A tool to ease the configuration

In our first article, Barney didn’t like editing 750+ terragrunt.hcl files and we neither.

As a reminder, our infrastructure (and its Terraform representation) is organized around cells and subfolders. Every cell contains the same set of subfolders as the others : one subfolder for our Kubernetes cluster, one subfolder for our Elasticsearch cluster, one subfolder for the in-house service X, etc…

It means a few things regarding the scale of our repository:

A huge part of every terragrunt.hcl is identical to the 750+ others
A developer must edit around 15 (one per cell) terragrunt.hcl with almost the same content when he wants to change something for its service
We have no easy way to manage all these files

We found no way to resolve all these pain points.

So we built the way.

Introducing… infracli !

The Infra team after building their dream CLI

Infracli

Infracli is both a library and a CLI written in Python.

Its purpose is to ease the management of common (or repetitive) tasks on our repository, for both developers and the Platform domain.

Features

Plan and apply a subfolder on all cells at once
Create a new cell, update an existing cell’s file
Connect to a GCP/Azure-managed PostgreSQL database
Manipulate secrets inside our Vault clusters
Create a new service, update an existing service
Bump the Terragrunt module’s version of a service
Run migrations on a service’s Terraform’s state or on its inputs

Manipulating secrets inside our Vault clusters will likely be the subject of another article and will not be addressed now. Instead, we will focus on service management.

The three main subcommands for this usage are :

infracli service create
infracli service update
infracli service bump

Our requirements

Here is what almost all terragrunt.hcl files look like :

locals {
  ...
}

include {
  path = find_in_parent_folders()
}

dependency "kubernetes" {
  config_path = "${get_parent_terragrunt_dir()}/kubernetes"
}

dependency "network" {
  config_path = "${get_parent_terragrunt_dir()}/network"
}

...

terraform {
  source = "vcs.com:repo//svc-go?ref=v1.18.0"
}

inputs = {
  ...
  network_name = dependency.network.outputs.vpc.network_name
  additional_secret_entries = {
    "foo/bar" = [
      "baz",
    ]
  }
  ...
}

A set of locals
An include block
Optionally some dependency blocks
A terraform block
A set of inputs

These elements may be identical or may vary a little (depending on the environment or the cell, for example), but differences are minimal.

Knowing this, we need a way to handle a generic file content and the ability to override some parts of it.

How infracli works ?

Beside infracli itself, we added two important concepts : a per-terragrunt-module template and a hierarchy of inputs.

Terragrunt.hcl template

Every terragrunt module contains a file named terragrunt.hcl.tpl which contains all the static code common to every subfolder instantiating said module. Basically, it contains everything the final terragrunt.hcl will have except the inputs.

Overridable inputs

We needed a solution to append per-service, per-cell, per-environment inputs to this static file, without repeating every inputs.

At the root of our repository, we added an inputs/ directory. Each service has its inputs defined in a dedicated folder. We require at least a file for Google and a file for Microsoft cells.

inputs/
  ...
  svc_foo/
  ├── go/
  │   └── default.hcl
  └── ms/
      ├── default.hcl
      └── dev/
          ├── default.hcl
          └── dev-ms-cell-001.hcl
  ...

If needed, one can add or override inputs per-environment (for example dev/default.hcl) or even per-cell (dev/dev-ms-cell-001.hcl).

With these two concepts (and infracli gluing them together) it became easy to modify every terragrunt.hcl. Whether the user wants to edit 750+ files at once or only a specific one for its service, the process is the same !

Now, let’s demonstrate infracli in action with a fake service.

Captain Holt is happy to see an example

Example

Let’s say we have a service named foo relying on the svc-go Terragrunt module.

The template

We define the template inside the Terragrunt module. It will be used by every service instantiating the module. We store it inside the module (vcs.com:repo/svc-go/infracli/terragrunt.hcl.tpl) :

locals {
  ...
}

include {
  path = find_in_parent_folders()
}

dependency "kubernetes" {
  config_path = "${get_parent_terragrunt_dir()}/kubernetes"
}

dependency "network" {
  config_path = "${get_parent_terragrunt_dir()}/network"
}

...

The inputs

Then we need to define the inputs. As LumApps is multi-cloud, we need to provide at least one file for Google and one file for Microsoft cells (see the directory structure before).

To keep our example short, we are going to define two files : one for all Google cells, and one for a specific google cell, which is going to improve and override the first file.

The first one would apply to all Google cells : inputs/svc_foo/go/default.hcl

terraform {
  source = "vcs.com:repo//svc-go?ref=v1.18.0"
}

inputs = {
  ...
  network_name = dependency.network.outputs.vpc.network_name
  additional_secret_entries = {
    "foo/bar" = [
      "baz",
    ]
  }
  ...
}

Then we define a file specifically for a given cell : inputs/svc_foo/go/dev/dev-go-cell-001.hcl :

inputs = {
  foo = "bar"
  additional_secret_entries = {
    "some/secret/${var.environment}" = [
      "key",
    ]
  }
}

The run

Finally, we run the command infracli service update svc_foo. For every cell, it will merge and overwrite all inputs and append the resulting content to the content of the template. Then write all of this in the terragrunt.hcl file.

We are closing the second part of this series on this. As of now, every LumApps developer can alter its service’s infrastructure by opening a PR which modifies a few files. This is a tremendous improvement on the time spent for both the developers and the Infra team.