Terraform Patterns, Observed

Part 3: State Misconceptions & Pitfalls

Published in

DevOops Discourse

10 min readAug 20, 2023

Note — This is purely my perspective as a practitioner with firsthand visibility into several working solutions in my career as a software consultant. Much of the vocabulary used in this series is of my own imagination and will surely cede to better nomenclature from the community. Moreover, many implementations I have seen in practice include multiple types of patterns discussed in this series.
Note — This commentary is neither definitive nor endorsed and is in no way representative of the views of my employer, Accenture.

Infrastructure as Code as we know it today has 3 major components that comprise a known state: the codebase that is applied (or the IaC tool’s interpretation thereof), the resulting live, instantiated resources (which is accurate only to the point in time when it was last read by the IaC tool), and the recorded state object (definitions of terms used in this article are below, in an appendix section) which is stored in a Terraform state file. I refer to this as the “state triplet” and I believe it to be fundamental to how IaC works[1].

In this post, we will begin our discussion of Terraform state by presenting some common misconceptions and typical pitfalls that can arise. We will also establish some terminology used in this and subsequent posts. This post is intended to provide context for the following post which will discuss typical management of Terraform state.

Common Terraform State Misconceptions

While Terraform is wildly popular, I find that it is largely understood by even most proponents. I’ll discuss some common misconceptions I’ve encountered more than once.

State Consistency & Drift

Terraform is generally idempotent[2], meaning the same code should produce the same outcome, no matter how many times it is run. Also, Terraform runs when you tell it to. Even if it’s running on a cron, it’ll only check on the schedule, and be accurate to that point in time. It won’t automatically detect changes you make in other contexts such as the platform’s CLI, UI, or API, or applied to a different Terraform state[3].

Thus, unless all updates to the platform are made through Terraform (or the Terraform/provider version is not kept well enough up-to-date), there’s a possibility of the dreaded configuration drift. When the drift is detected, Terraform will tell you how it will attempt to correct the drift, which oftentimes just means undoing the detected change. Sometimes, however, the configuration drift forces a resource to be recreated, which destroys the resource[4].

State Atomicity

Each terraform state is completely ignorant/agnostic of any other state. Each terraform apply results in exactly one state. Thus, if a terraform apply execution is 1–1 with a pipeline execution (a “job” or “run” or whathaveyou), then every pipeline execution is completely ignorant/agnostic of any other pipeline execution.

Moreover, one terraform state won’t detect changes you make in other contexts that don’t directly affect the encoded resources. Even changes to similar resources (e.g. a manually created VM that appears beside the encoded VMs in the platform’s user interface) via the platform’s CLI, UI, API, or another Terraform state will be undetected.

Thus, each state tree can be thought of as atomic[5]. Any connections between Terraform states would need to be implemented outside of the confines of the state triplet components (i.e. state file, Terraform code, and instantiated resources).

Not for “1-Offs”

Terraform does not lend itself well to operations that must be run once and only once, or those that are expected to be fire-and-forget. The very nature of Terraform is to support the consistency of the instantiated resources that make up your IaC Ecosystem. Many 1-off operations may be more appropriately performed directly with the underlying platform API request or CLI command.

Also, typically, the entire state under execution is checked any time terraform plan is run. Terraform does support a -target flag, which can be used to “apply changes incrementally”[6] and could conceivably be leveraged for “1-off operations” (maybe rolling some key or incrementing a stored object version). However, I would consider this an advanced feature and I would recommend building governance around its use.

Using Terraform for 1-off operations, especially within a larger, idempotently-consistent state, runs the risk of additional operational overhead either in encoded guardrails (e.g. building a wrapper around the -target flag for additional control over when it is leveraged[7]) and in clunky source control management (e.g. additional local git pull operations to keep the encoded stored object or encryption key versioning consistent, or when reading the terraform state information, locally).

Common Pitfalls

There are several things to watch out for when architecting for a healthy, well-shaped state.

Locked State

A state’s lock file is the mechanism that protects Terraform from attempting to make multiple concurrent attempts at applying plans. Especially in the case of automated builds, it can prevent unintended actions. Unfortunately, it also prevents intentional actions, potentially requiring otherwise automated jobs to be manually restarted. Ideally, the state is sized and organized in such a way that the action of applying a plan completes before a new plan is attempted. Aborting the execution of a terraform apply command may leave the state locked, which may require manual intervention.

Broken or Inconsistent States

Perhaps worse than a locked state is a broken state. A broken state occurs when Terraform discovers the state object does not reflect reality, and either reports a plan that makes a number of unexpected changes or fails altogether with an error message. It can also arise with mismanaged resource and module referencing or stale Terraform or provider versions. In any case, a broken state causes confusion and impedes further development until the state is reconciled (or the deltas are ignored and simply bashed away upon the next terraform apply). On the extreme (but by no means rare) end, a state requires field triage surgery, with an intrepid professional making momentous excisions to the state tree.

There are possibly uncountable paths to a broken state. Here is a short list of common missteps:

Misusing for_each and count expressions [8] — Only use the count expression to provision a resource/module conditionally. Use for_each sparingly and never use a generated map whose keys are determined by an ordering/sequencing function. This will be discussed further in a subsequent post.
ClickOps development undoing changes applied with IaC (and vice versa).
A data resource (perhaps indirectly) references a managed resource elsewhere in the state tree. The non-deterministic data resource is then referenced in another managed resource for a field that forces the replacement of the resource each time terraform apply is executed[9].
Version constraints are introduced and forgotten; meanwhile, the Terraform CLI version is advanced per organizational policy, resulting in a version mismatch.

Uneven State Trees

While there is nothing inherently wrong with a state tree that is lopsided or jagged, I would suggest that uneven distributions of resource definitions increase the overall complexity of troubleshooting. The longer, the state hierarchy, the more places that may require attention when resolving a bug. Typically, a new variable definition at a leaf node requires analogous definitions at ancestor nodes. This creates a cascading effect up the state hierarchy, oftentimes all the way to the root. Sometimes it’s not possible to avoid some unevenness, but having a high ratio of nesting cardinality to node cardinality along a longest node ancestry (e.g. a giant module with a high degree of child submodules somewhere buried down a module chain) is more typical of haphazard detachment rather than mindful abstraction.

A more measurable outcome of an uneven state tree is a severely uneven distribution of activity throughout the tree. Even in well-balanced state trees, some nodes may see more activity than others; this is neither problematic nor especially interesting on its own. However, when a tree has a few resources that change often/regularly with many resources that change rarely/irregularly, it can suggest an unnatural coupling of the comprising resources. Certainly, including unrelated resources into the same state tree can be especially irksome with locked or broken states, when the offending state node is unrelated to the intended resource changes.

Too Many Cooks

For better or worse, multiple teams working in (sometimes loose) coordination on a single IaC Ecosystem is fairly standard. When a single repository is developed by multiple teams, it can very quickly cause friction with the presence of any locked or broken states. These are in turn at an increased potential, as multiple teams can compound the likelihood to include dissimilar resources. Extremely distanced teams or team members can magnify any such discomfort, especially with large disparities in solution exposure.

State Bloat

When the state tree has grown so large that it takes too much time and resources, it may begin to cause severe slowdowns to a terraform plan or terraform apply. Even with enough resources, the running time can be prohibitive in the case of a typical development team collaborating to rapidly construct a single state. Suddenly teams are met with artificial barriers to introducing their latest solutions, or far worse, a much-needed security patch. Avoiding state bloat is addressed further in the next post.

Next Time…

We’ll continue our discussion of state, describing how to manage and organize modules for well-balanced state objects.

Appendix — Terms & Definitions

Remote Resource — a resource maintained by a different Terraform state (or by a different mechanism altogether) can be called a remote resource. This must be referenced using a data block.

State Tree — The full representation of all items and relationships that Terraform uses to maintain the desired behavior of instantiated platform resources as recorded by a single terraform apply execution. Note that this would be represented as a collection of state objects in the output of a terraform plan or terraform apply.

Backend State — A representation of a state tree that is stored in a uniquely located file. The unique location is determined by a backend selector that can be configured at runtime (i.e. at the execution of a terraform plan). The backend state can be moved or manipulated using Terraform CLI.

State Node — A named resource or module located somewhere in the state tree.

State Object — generically, this can be any complete collection of recorded state elements. In Terraform, the most general state object is the state tree itself. For the purpose of this series, whenever we are discussing Terraform, we will use state object to denote any proper subtree of a state tree.

Root Node — The state node at the root of a given state object. In the case of the state tree, the root node is always a module even in the case of a single resource.

Leaf Node — A special state node at the leaf of the state tree. This is always a resource.

State Hierarchy — The chain of state nodes along a single depth-first search from the root to a leaf.

Parent/Child/Sibling Nodes — Describe relative state nodes in a typical tree configuration; however, in regular conversation, this may represent more of a logical relationship than a strict relationship, given how certain Terraform elements are stored in the state[10].

Locked State — Typically, a state that is in the middle of a running terraform apply command. Sometimes, the lock is left behind when an execution is aborted without care.

Broken State — A state tree that is inconsistent with the live, instantiated resources, thereby breaking the most important relationship of the state triplet. This must be reconciled[11] before safe usage can continue.

State Progeny — The collection of nodes in a state object, with the exception of the state object’s root node.

State Ancestry — The chain of parent nodes from the given state node to the state tree’s root node.

Node Cardinality* — For any given state node, this is the number of child nodes.

Root Cardinality* — For any given state object, this is the node cardinality of the root node.

Leaf Cardinality* — For any given leaf node (i.e. resource), the number of state nodes in the state ancestry.

Nesting Cardinality* — For any given state object, this is the largest value of leaf cardinality.

*Note that this is at least 1.

Footnotes

[1] I don’t believe it to be further decomposable, without compromising integrity.

[2] With I’m sure more than a few exceptions that I won’t enumerate. Let’s say that it can generally be considered idempotent unless a command/field/expression is documented otherwise.

[3] Not to suggest one couldn’t develop a solution that would run Terraform in this way, but it certainly doesn’t do so out of the box.

[4] This can even happen just out of poorly written code, or by using unsafe versions of certain API fields.

[5] It is certainly possible to encode a resource in two locations and import the instantiated resources into two distinct Terraform states; however, I seriously challenge the rationale for doing so beyond contrived oddity.

[6] https://developer.hashicorp.com/terraform/tutorials/state/resource-targeting

[7] And what happens when you don’t want the key to be rolled or have no new object to store as a new version?

[8] These two are common enough to be discussed in the Hashicorp documentation (here and here, respectively), but it bears including.

[9] For some resources, a temporary toggle of its existence may be invisible to the system and any users, but it should generally be avoided, nonetheless.

[10] The simplest example is how a for_each’ed block is represented; the state objects in such cases are technically children of a collection element, but we would logically consider them as children of what is actually their grandparent node.

[11] A healthy reconciliation process might be to ignore the inconsistency and blow away whatever changes were introduced else wise; be sure this inconsistency isn’t due to something else, e.g. referencing the wrong backend prefix.