Terraform Patterns, Observed

Part 4: State Management

Published in

DevOops Discourse

10 min readSep 17, 2023

Note — This is purely my perspective as a practitioner with firsthand visibility into several working solutions in my career as a software consultant. Much of the vocabulary used in this series is of my own imagination and will surely cede to better nomenclature from the community. Moreover, many implementations I have seen in practice include multiple types of patterns discussed in this series.
Note — This commentary is neither definitive nor endorsed and is in no way representative of the views of my employer, Accenture.

As presented in the previous post, Infrastructure as Code has 3 major components (which I refer to as the state triplet) that comprise a known state: the codebase, the live, instantiated resources (or state domain), and the recorded state object (stored in a Terraform state file).

In this post, we will begin by establishing the common characteristics and constraints regarding state management, including how to approach the shape of the state tree. Then, we will consider how different approaches to module arrangement (see Part 2) affect the complexity or cardinality of the state. Finally, I will present some additional thoughts for consideration.

State Management

Some of these insights may be quite intuitive, whereas others may seem subtle and perhaps uninteresting. Unfortunately, it tends to be the latter that lays dormant until it is far too late to address, economically.

State and Input

A Terraform state tree is constrained by the active module that it was built from: the active module, and the local and remote modules that it references, determine the shape of the state tree up to e.g. branch cardinality of blocks including the for_each or count meta-arguments. For better or worse, these latter cases are common, and when they are randomly dispersed throughout a state tree, they can make things unwieldy both conceptually and technically when issues arise.

In this way, the power of Terraform can be double-edged: abstraction allows for consistency and efficiency, while also inviting complexity, over-engineering, and spaghetti code. And this manifests in the state tree: maybe it’s okay to iterate over 100 configuration objects in a leaf module if one does it one time, each terraform apply, but what if that leaf node is being fed the exact same inputs in some other, non-leaf node? This creates costly redundancy and while it sounds contrived, it is based on a true story, so to speak[1].

Once an appropriate level of abstraction is reached and proven useful for a particular use case, it could be leveraged as an accelerator (sometimes directly) for other such instances of the use case. Although different inputs against the same backend will result in updates to that backend and state domain, a distinct set of inputs (e.g. in a separate .tfvars file) targeting a unique state file backend location will result in a new, full replica of the state domain[2], allowing for solutions to scale out to additional teams and products. Be sure to avoid the dangers of both overfitting and over-engineering. It may be better to fork and shape as necessary (again, targeting a unique state file backend location).

State Size and Shape

The blast radius of a broken or locked state is the entire state tree. All resources managed by a single state share a single lifecycle. Consider using a domain-driven approach to chunk states to a reasonable size and level of atomicity.

Big states certainly take longer to traverse than small states[3]. Logically, we also want a high ratio of leaf nodes to state nodes (or e.g. resource blocks to module blocks as they are proportional) so we’re spending memory and state data efficiently on running resources rather than on organizational metadata.

Perhaps unintuitively, this balance can also affect code volume as separate states may introduce data blocks in the case where one state’s resources reference a resource maintained by the other state. These references are very useful to ensure the dependencies exist, but they also duplicate (at least) every relevant code change. When the data references are hot, or change often, this can be cumbersome.

On the other hand, the over-presence of DAGs in multiple places along a state hierarchy implies either a bloated state or too deep of a nesting and would be better served by separating the different module references into distinct state trees. The balance between many hot data references and a bloated state is certainly a key (albeit more qualitative) indicator of IaC maintainability.

State Relocation and Propagation

When it is determined a state must be relocated or segmented into multiple states, there are a number of options, with some including direct edits to the state components like the state’s backend location, or even the state tree itself. These operations can be destructive (or frustratingly difficult to disentangle) if performed haphazardly, so they should be a late if not last resort. Ideally, one would just wholesale recreate the resource in the new state, altogether, and destroy the resource in the old state (possibly in reverse order to prevent specific collisions), rather than manipulating a resource’s remote state directly. However, as this isn’t always an acceptable approach, a serviceable approach would be to import the live resource into the new remote state and prune the state tree using terraform state rm, or in the case of wholesale relocation, one can use terraform init — migrate-state[4]. Neither method is atomic[5], which augments the risk of any mistake, so this must be done with extreme caution and care.

Module Arrangement and State

Flat

Each individual state tree is small, or at least very short (it’s essentially all leaf nodes). Likely dozens if not hundreds of state trees. Possibly many state trees served from the same codebase with unique variable assignments (e.g. by supplying mutually exclusive .tfvars configuration files to the execution of terraform plan as described above). The unique state objects will need to be carefully organized to avoid confusion or state collision.

1-Level Remote Nesting (1LRN)

This typically produces a relatively comfortable state management balance amongst the size and shape of the state tree(s), the time the state is under lock, and the blast radius of any broken state triplet. The biggest risk might be in over-segmenting states to the point of breaking apart logical groupings, unnecessarily injecting an order of operations that Terraform was designed to address in the first place[6]. More on this below.

N-Level Remote Nesting (NLRN)

This also has a high potential to produce a relatively comfortable state management balance just as in 1LRN. However, whereas 1LRN risks erring on the side of too many small state objects, NLRN might err on the side of too large and unwieldy state objects. It is the most challenging to optimize and is certainly the most overhead. However, as it is essentially the de facto pattern when leveraging many open-source community modules, the overhead may be offloaded. No matter how many remote modules (and development teams thereof) are involved in an active module’s execution, the state object itself should be well understood by the active module’s operators and should be shaped in such a way as to balance breadth, depth, and interdependencies.

1-Level Local Nesting (1LLN)

In theory, because the state tree is shallow[7], the memory Terraform will need to track the state should not become unmanageable, though the execution of an applied plan could still be unreasonably time-consuming if an entire IaC Ecosystem is served by one state tree. Also, because these modules are often tightly fit to the single-use case for which they were initially developed, a single state object is generally served from the module’s codebase (rather than servicing multiple state objects by e.g. supplying mutually exclusive .tfvars configuration files to the execution of terraform plan).

N-Level Local Nesting (NLLN)

This combines the dangers of both previous approaches. Very similar to NLRN, this might err on the side of large/ unwieldy state trees. This may also result in massive state trees, in general, if attempting to manage a single IaC Ecosystem from one tree. In any case, the state tree itself should be well understood by the module’s developers and should also be properly balanced/shaped.

Mixed Approach — Local & Remote Modules

As before in Part 2 it is probably most useful to consider the costs and benefits of each individual approach, and to be deliberate (rather than aimless) in targeting a mixed approach. Because this is highly likely to include 3rd-party modules at various depths, consider regularly considering/reviewing the state trees that are created by the active module.

Additional Thoughts

I think it’s fairly obvious that the state itself is the last of the state triplet to be seriously considered. The live, instantiated resources (those comprising the state domain) are certainly the first consideration given that we’re employed to deliver marketable utility. It is natural for software-oriented technologists to next consider the software interface with which we are oftentimes most familiar: code and scripts.

It’s unfortunately not uncommon to get all the way to productionland and find yourself in a mire of sub-optimal/terrible state management and, to quote David Byrne, “you may ask yourself ‘how did I get here?’”. The Terraform state is (thankfully) editable, for the very brave, so this won’t burn down your… never mind, what I mean is you can (probably) get out of whatever jam you get into with your Terraform state. However, it’s certainly not something one wants to have to learn (or shop for) when productionland is already in peril.

State and Order of Operations

It is most likely that the IaC Ecosystem is not a single state tree. It’s equally likely that at least two of the discrete states have some dependency or interdependency. The former is almost unavoidable in large enough IaC Ecosystems and happens e.g. when you have one state tree servicing compute resources and another tree servicing the networking components. So long as the references aren’t too hot, these should be manageable.

The latter is where problems more typically (and persistently/chronically) arise. When two states are interdependent — i.e. either the states’ dependencies are bidirectional[8] or the dependencies create a cycle[9] — a change can require an awkward level of coordination and overhead such as several interstitial, incremental updates to tear down, unclad/unlock[10], or otherwise prepare one or both states for certain, should-be-atomic changes to each respective state[11]. And just to be clear, this is the “happy path”: when such an ecosystem is operating by design.

A better way is to avoid any cyclical/bidirectional dependencies and consider redeveloping any portion of the ecosystem in which such dependencies arise. Remember, while it may feel icky to have to double your resource count (and thus operational expense) for a quarter or play some wacky sliding puzzle game with your networks to keep from running out of RFC 1918 space, this is the flexibility afforded by cloud and infrastructure as code. If the ickiness is truly overwhelming it may be exposing some missing tooling/convention/governance, architectural inefficiencies, or unleveraged mitigation techniques in your ecosystem[12].

ClickOps Driven Development (CODD?)

An emergent phenomenon is what I’ve heard most commonly referred to as “ClickOps Driven Development”[13]: resources are created manually to fully configure and ensure a working solution, and then tagged, named, or labeled to signify readiness for the resource to be incorporated into the IaC Ecosystem as code. While this is wildly inefficient, especially in contrast to well-architected solutions, it can bridge the gap in the case of inexperience, unfamiliarity, or under-coordination (e.g. or perhaps “severe asynchronicity”). Perhaps a more economical use case would be to reserve this approach for rapid prototyping or proof of architecture via the ClickOps activities, after which a full IaC Ecosystem architecture can be designed and scaled in Terraform.

Next Time…

We’ll explore logic and expressions in Terraform, both considering the common successful patterns in Terraform and the merits thereof. As always, we’ll also review anti-patterns and how to avoid them.

Footnotes

[1] It is actually based on countless true stories.

[2] Assuming of course e.g. name or IP collisions are avoided.

[3] Assuming similar resource types are targeted in proportion, equivalent compute and networking capacities are leveraged, and all other things “equal” from an Occam’s Razor perspective.

[4] See this post for more details: Terraform Backend Migration: A Journey Worth Taking.

[5] I.e. it is not possible to do in a single operation, so a failed insertion into the “to tree” won’t un-prune the “from tree”.

[6] A contrived example might be separating the state of a VM from the state of its attached disk, rather than allowing Terraform to manage their interdependence. While operations on the VM might not necessitate an update to the disk, the reverse might not be true; now each change elicits new consideration, possibly requires multiple triggers, and could go wrong if operations are performed out of order. There are probably good reasons to assume this risk, but I would err on the side of expecting them to also be so contrived.

[7] Each resource is 2–3 hops from the root, and any interdependencies (resulting in a graph, rather than a true tree) are in the leaf-most modules.

[8] I.e. where one state includes optional configuration, e.g. IAM/RBAC bindings, that require data generated from the second state, e.g. resource IDs, and that second state depends on data generated from the first state.

[9] Explanation and example left as an exercise to the reader.

[10] I.e. disabling a prevent_destroy or issuing a command to gracefully disengage one or more agents.

[11] I.e. the result of a single successful execution of terraform apply on a valid plan.

[12] In less roundaboutly cordial and more specific terms: if you can’t stand for any part of an internal network to be recreated as part of an IaC Ecosystem redesign, you probably have much bigger risks than even perfectly designed Terraform can mitigate.

[13] A somewhat curious nod to e.g. Test Driven Development.