Scaling Terraform at ThousandEyes

--

by Ricard Bejarano, Lead Site Reliability Engineer, Infrastructure at Cisco ThousandEyes

At ThousandEyes, we manage all our cloud infrastructure using Terraform.

Terraform is a tool by HashiCorp that lets you define your infrastructure declaratively as code. It’s the de-facto tool for such a use case, with broad adoption across cloud providers and a great community behind it.

Over the years, we have pushed Terraform to its limits in all possible directions. And though we’ve had many successes with using the tool, here’s an overview of the challenges we’ve faced when scaling Terraform to our needs and how we overcome them.

Episode I: The Plan Time Menace

Every change in Terraform is as follows:

  1. You change the Terraform code that states “what you want”;
  2. You run `terraform plan` that does two things:
    a. Performs a “refresh” of the state, calling the providers’ APIs to fetch the latest status of all resources managed by Terraform and writing any updates to the state; and
    b. Compares the newly updated state with the code you fed Terraform and computes the difference between them, the result being whatever changes need to be performed which are then stored as the “plan”.
  3. If the plan is coherent with your changes to the Terraform code, you then run `terraform apply` on said plan, and this makes the corresponding write operations on the providers’ APIs to create/update/destroy the resources you requested through code.

When done at a low enough scale, this process works just fine. However, when your state contains thousands of resources, plan times start going through the roof. That is because you make thousands of API requests to your providers every time you plan. Add rate limiting to that equation, and you get what we got: plan times of up to 2 hours!

There are two ways out of this: disabling refreshes or fragmentation.

Disabling refreshes is risky because you can give Terraform an outdated picture of reality when computing the plan, which could lead to disastrous consequences. We do not recommend you do that. (As a side note, we are currently exploring how to do this without breaking consistency guarantees, but we haven’t tested it yet.)

That leaves only fragmentation: to break down your Terraform scope down by some dimension (e.g. account, region, environment, tenant… or any combination of them!).

This results in more, smaller Terraform states, which reduces the number of resources to be refreshed upon plan and thus, plan times.

We call this “horizontally scaling” Terraform projects, code or state.

However, over time, this leads to problem number two.

Episode II: Attack of the Clones

Horizontally scaling your Terraform code implies code duplication.

Sure, you can abstract as much of your logic away into modules as possible, but that still leaves things like `terraform` configuration blocks, provider requirements, variable definitions, and the module calls themselves that you need to manually copy-paste between all your analogous Terraform projects (once for prod, another one for QA, one for dev…).

This process is mostly boilerplate: code that’s both duplicate and non-functional.

We call it non-functional because the code doesn’t map to anything in real life. It’s scaffolding for Terraform to work, but it’s got nothing to do with the infrastructure we’re provisioning.

Ideally, all your Terraform code should map 1:1 with reality, and Terraform users should only push resources and modules. Everything else should be computed for them.

Episode III: Revenge of the Drift

Furthermore, code duplication leads to an even worse problem: drift.

As we figured out over time, when two or more files need to have the same contents but there’s no mechanism to enforce that, they’ll eventually differ — the law of eventual (in)consistency. An instance of this phenomenon is called “drift.”

Drift is the source of many issues, mainly because it creeps into your code; it doesn’t announce itself, so it’s hard to keep it away.

Drift in Terraform, unlike other tools, is particularly awful since it has state. And there’s only two ways to fix Terraform state, and both are awful.

Fix number one is to perform what we named “state surgery,” which is the arduous process of manually reconciling state to real life through `terraform import` and `terraform state rm` commands. This fix, on a thousand-resource Terraform project, takes a lot of work.

Fix number two is simply to make a plan and run with it. But that’s also not an option when the infrastructure you’re managing supports critical services that must not go offline, as it may lead to a recreation of all your resources.

Ultimately, you don’t want to be fixing Terraform drift. You want to prevent drift from ever getting in.

Episode IV: A New Hope

So, you might ask, how did ThousandEyes fix all these problems?

We needed a solution that allowed us to scale our Terraform code horizontally (more and smaller projects to keep plan times low) while preventing drift between them and keeping boilerplate low. And all while keeping the users’ learning curve low so we don’t add more friction to the process of managing infrastructure programmatically.

We first looked at the available solutions out there. Terragrunt, Terraform Workspaces, and CDK for Terraform were at the top of that list.

Terragrunt would’ve required a rewrite of our already existing ~1500 Terraform projects. This was not ideal, we wanted to do something transparent to our users, as raising the skill requirements could’ve hurt our Terraform adoption.

On top of that, Terragrunt adds even more boilerplate, not less, through its infamous `terragrunt.hcl` files.

Terraform Workspaces did not have those hurdles. However, since it was not designed with our purpose in mind, it would have prevented us from having drift where we needed it. See, not all divergence between analogous environments is drift. Most are, but sometimes, you want deployments to differ for some reason, and Workspaces’ interpolation mechanism could have been more user-friendly.

Finally, CDK for Terraform (CDKTF) seemed interesting, though not for replacing our entire codebase. As I previously noted, we didn’t want to rewrite ThousandEyes’ ~1500 projects to some other syntax/language. What seemed more logical was to write our own interpretation of Terragrunt, which kept the same promises we gave our users, to solve our issues with CDK for Terraform. And so we did; our initial proof of concept used CDKTF as its fundamental building block.

It was only a short time until we replaced it, though.

See, CDKTF is simply a way of imperatively defining declarative Terraform code. It allows you to use language-specific objects to define and manage resources that will end up in a JSON file for Terraform to consume declaratively.

The problem is that CDKTF is written in JavaScript, so every time we plan, we’d need Node.js to run, transpile our code to JS, execute, synth, then report back to Python, and return. Not only was this process a tax on performance (~15s extra per plan), but it also made it impossible to debug when an exception was thrown. All we’d get would be references to lines of code in .js files we didn’t write. (I don’t know if this is still the case or if they’ve added mappings; we haven’t used CDKTF since.)

So, since all CDKTF does is output JSON, we decided to do that ourselves. And we called it Stacks for Terraform!

Stacks is a Terraform code pre-processor. Its operation is remarkably simple, with the initial implementation being less than 200 LOC:

  • Terraform code is written following an opinionated file and directory structure,
  • This code is picked up by Stacks, which transforms and injects additional code,
  • Then, it is picked up by Terraform to plan and apply.

This “opinionated directory structure” includes features like a “base” directory where you state code that will be instantiated N times, once per “layer.” Users can also define layer-specific code, including overrides of base wherever needed. Stacks also supports overriding variable scopes, globals (for all stacks), stack-scoped (for all layers in a single stack), and layer-local.

It turned out to be very similar to tools like Kustomize for Kubernetes, which is funny since I hadn’t used Kustomize before designing Stacks!

You can find an example stack here.

Stacks for Terraform enables us to horizontally scale our Terraform projects into small ones to keep plan times low, without restricting our ability to deploy a set of resources N times to all our environments, all while removing the need to copy-paste its code over and over and preventing drift without cutting back on per-layer customizability.

Watch my talk at SREcon for the complete picture.

--

--