Terraform Patterns, Observed
Part 2: Module Arrangement
Part 2: Module Arrangement
This is purely my perspective as a practitioner with firsthand visibility into several working solutions in my career as a software consultant. Much of the vocabulary used in this series is of my own imagination and will surely cede to better nomenclature from the community. Moreover, many implementations I have seen in practice include multiple types of patterns discussed in this series.
This commentary is neither definitive nor endorsed and is in no way representative of the views of my employer, Accenture.
Teams and organizations (i.e. software engineering business units within a larger bureaucratic entity e.g. a corporation) target Terraform to provision and maintain their infrastructure. Because Terraform primarily acts as an abstraction layer over the APIs already exposed by the service provider in which the resources will be built, there are few structural standards imposed by the tool itself. Moreover, while there is a recommended standard for individual modules, there doesn’t seem to be a de facto standard or common industry practice for how to stitch them together (or juggle them in parallel) to make a robust end-to-end solution. This leaves it largely up to the team or organization to determine how many repositories and terraform state trees are needed, and how to arrange the resource and module blocks efficiently within them.
In this post, we’ll continue the analysis of module structure from the preceding article and focus on module relationships and arrangement. We will also describe some technical aspects of various arrangements, when they are appropriate, and the general cardinality of the resulting repositories. We’ll end with a discussion on module repository management and some thoughts on how to select a fitting overall approach.
Terms
Infrastructure as Code (IaC) Ecosystem — The totality of running resources and managed services that are maintained by Terraform for an entire, self-contained organization[1]. A single IaC Ecosystem would include e.g. the networking and access management components of software delivery lifecycle (SDLC) environments (such as development, integration, QA, UAT, performance, pre-production, and production), as well as self-hosted internal services necessary to deliver said software or to provide productivity and business solutions.
Distributed (Infrastructure) Team — Typically configure and support compute, storage, and app-specific IAM/networking for each SDLC environment in a standard, controlled fashion for customer-facing products. Oftentimes one distributed team per product or product domain.
Centralized (Infrastructure) Team — Typically configure and support the foundational components of an entire IaC Ecosystem such as baseline networking, IAM controls, and shared platform services. Oftentimes one centralized team or one per domain (e.g. network security, compliance, etc.). Sometimes a centralized team will build inactive remote modules for distributed teams to consume. In certain cases, the centralized team(s) will build everything.
Module Arrangement
Flat
Types of Modules — All modules are exclusively singleton and terminating composite. All modules are active.
Repositories — Dozens to potentially hundreds of repositories across the engineering teams.
Code — Very low module complexity, but a high chance of heavy code repetition. Thus, the modules are ideally developed in a way that allows a high degree of reuse through multiple unique configuration files that are individually referenced when a plan is executed; while this technique is not unique to a flat approach, it is one of the best ways available to mitigate repetition. Ideally, any code complexity is either outside of Terraform (in some kind of fancy state management system using an actual programming language).
1-Level Remote Nesting (1LRN)
Types of Modules — Remote modules are exclusively singleton and terminating composite; active modules are primarily abstracted composite (possibly some wrapping modules).
Repositories — Dozens to potentially hundreds of remote modules owned by a central engineering team, all made available for use by distributed engineering teams in remote module source reference; while it may violate the standard module structure, in practice, several such inactive remote modules may be packaged together in a single bundled repository. One to typically no more than a handful of active modules owned by the distributed teams (oftentimes these are tightly fit-to-purpose).
Code — This can result in relatively low module complexity, assuming no modules are black boxes for consuming teams. Again, ideally, any code complexity is either outside of Terraform code or at least contained in the remote modules and minimized to purely input simplification (more on this in a later post). Also, if done thoughtfully, this should allow for a great deal of flexibility in module versioning (allowing for the centralized team that owns the module to adopt new features or techniques, without compromising existing solutions), albeit perhaps at the expense of module-leveraging distributed teams chasing updates or unwittingly missing out.
N-Level Remote Nesting (NLRN)
Types of Modules — All valid types are available. Likely to include 3rd-party modules (this may even be strategically a focal point of the architecture).
Repositories — Similar to 1LRN, this may result in many (mostly) remote modules repositories owned by a centralized team (or external 3rd parties) and 1 to a few (mostly) active modules owned by the distributed teams [2].
Code — This has a high potential for overly complex module relationships. Strong interdependencies should be pushed as far to the root or leaf modules as possible, to maximize traceability across modules/repositories. Also, because of the likelihood of 3rd-party modules, we may have a great deal of code complexity as well.
1-Level Local Nesting (1LLN)
Types of Modules — Local modules are exclusively singleton and terminating composite; active modules are primarily nesting composite.
Repositories — Possibly only 1 or a few per team. Each repository would fully determine the shape of the state tree. These are likely at least loosely fit to purpose for a use case but may serve a number of similar instantiations.
Code — Again, hopefully, any complexity is either outside of Terraform or at least contained in the local modules and again minimized to purely input simplification. If so, there is little chance of high complexity, by design. As complexity is introduced (especially any looping or value generation), it would be advised to consider moving to one of the other approaches.
N-Level Local Nesting (NLLN)
Types of Modules — This revolves around the use of coiling composite modules. As such, this is likely an anti-pattern[3].
Repositories — Possibly only 1 or a few per team. Again, each repository would fully determine the shape of the state tree.
Code — Very similar to NLRN, this also has a high potential for overly complex module relationships. As this generally suggests a lack of strict governance or foresight, this is likely to invite logic where other mechanisms might be acceptable.
Mixed Approach — Local & Remote Modules
This is some combination of the previous 4 approaches and may be the most common approach in practice, especially in the absence of a strong point of view or enforcement. Because a combination of approaches (and proportions thereof) is technically possible, I believe it is most useful to consider the costs and benefits of each individual approach.
Note — Hashicorp’s Standard Module Structure is a specific example of this: the standard module has locally nested modules, and would be referenced as a remote module in some other active module, thereby mixing the local and remote nesting approaches.
This typically arises organically, rather than deliberately, often in cases where a team relies heavily on one approach (either traditionally or purposefully), but realizes another approach would accelerate solution delivery. While I won’t claim the ends justify the means, I also won’t suggest the result is inherently invalid. I believe there are wholly valid such approaches (most probably resembling some combination of 1LRN and NLLN[4] but with a profile more reminiscent of NLRN[5]). Some example scenarios:
- Scenario 1 — A team that relies almost exclusively on 3rd-party modules but wants to e.g. introduce some default controls the remote module has chosen to remain agnostic of (a common example might be default RBAC or network controls associated with a compute service instance). In the latter case, they might introduce an interstitial local module to bundle together the remote modules in some standardized way.
- Scenario 2 — Similarly, a team relying on many local modules, may realize they can offload some of their maintenance costs by leveraging 3rd-party remote modules for common services.
Certainly, a deliberate approach is best, and some considerations for finding a good fit between local and remote modules are presented below.
Additional Thoughts
Branching Strategies
There is a common notion I’ve encountered more than once of viewing Terraform code like it’s a scripted or object-oriented language and attempting to develop a branching strategy that takes a single codebase through different software lifecycle environments, eventually releasing it to production. But remember: Terraform just sits on top of the service provider from which you consume. You are a customer, an end user. This is all “production” from your service provider’s perspective, and it should all be considered “production” for you.
Unless you are literally building a completely isolated replica ecosystem or sufficiently complete “IaC test harness” to “test” the results of applying your Terraform before introducing it into your “production ecosystem”[6], there is no “development branch” for Terraform[7].
“But Robbie, what if I want to build my ‘DEV’ environment from a ‘development’ branch, my ‘QA’ from a ‘test’ branch, and a ‘PROD’ environment from a ‘production’ branch of the same IaC repo?”
Well, does your DEV environment look and behave exactly like your PROD environment, with the exact same profile of compute and storage resources, the same network performance constraints (and provisioned tier, where available), and an equivalent security boundary protection solution? And where does your environment-specific configuration go? Is it kept on every branch, or is it externalized and fetched (hopefully from another repo, so that changes can be code reviewed)? I don’t claim that such a Branch-per-Environment pattern isn’t possible (for every obstacle I can conjure, I can also conceive a solution), but it all starts getting more and more complicated for less and less material return on your engineering investment.
That being said, what I have seen work really well might be referred to as the Versioned Environment pattern. This would use a set of 3rd-party modules that are versioned (ideally using something like SemVer, rather than standing, environment-oriented branches), and have environment-specific active modules. The active modules would then follow a single-branch release model in which all changes are governed by pull requests to a centralized “live” branch that will be subsequently applied[8]. This allows highly controlled environments without having to:
- Keep multiple branches on the same repository pristine to service multiple distinct Terraform states
- Include additional extraneous-to-the-environment configuration files (.tfvars)
- Fetch uniquely-necessary-for-the-environment configuration files from remote locations[9]
Another successful pattern I’ve seen in this vein leans heavily on the automation platform, where pipelines are built such that they can be supplied certain information at runtime such as the backend state location and identifier and the repository and branch from which it pulls code. This might even simplify/enable the Branch-per-Environment pattern as described above, or be used to expand the Versioned Environment pattern. An example enhancement could be implementing special pipeline controls that would only allow plans from a specified branch to be applied, allowing for other branches to run plan pipelines to understand the results of their works in progress according to the Terraform state[10].
Remote vs. Local
I would expect a perfectly executed approach with fully remote modules will win out over a perfectly executed approach with fully local modules. Remote modules allow for much greater control over access, versioning, and abstraction, and therefore better adhere to common software development practices and principles (e.g. DRY, PoLP, SOP). Unfortunately, remote (especially 3rd-party) modules can die on the double-edged sword of SRP/POLP if the turnaround time to make updates to modules becomes a bottleneck (or the consuming team will simply replicate the insufficient module, thus again duplicating work). To mitigate this, it’s common and encouraged to take an OSS approach to module development, allowing consuming teams to fork the repository and submit issues and pull requests.
It also takes a great deal of discipline to keep the remote module references up-to-date and fully consistent. Frankly, there are many common behaviors that introduce drag in a remote module approach — poor/stale documentation; inconsistent versioning schema; no module deprecation process; a laissez-faire approach to module distribution and ownership[11] — for which the discipline of the engineering teams may be the only remedy.
In the end, I would generally recommend using a combination of both rather than trying to adhere to a purely remote or purely local approach (which again is in line with the recommendation to implement Hashicorp’s Standard Module Structure. However, even the standard module structure is still very flexible and cannot prevent painting oneself into a corner or from creating cruft and cringeworthy code.
The question then becomes when to use a local and when to use a remote module. In a perfect world, where timelines and budgets are infinite, I would say that no active modules (those with .tfvars and the root of which is the target of a terraform plan) should have local modules, and no “standard modules” (those with local modules) are active. Also, I don’t generally believe any local modules should be considered “private”.
Code Complexity
So far we’ve focused on complexity from a module hierarchy perspective, only briefly referencing the complexity of code (actual expressions, constructs, and logic) and how it might be avoided or where it might be sequestered. Ideally, any module logic should be on the order of e.g. ingesting a map with few, highly varied key/values, and formulating a local map with possibly many more, much less varied key/values for use in a resource definition’s for_each statement. More on code complexity in a subsequent post.
Next Time…
We’ll explore state in Terraform, and discuss how different approaches affect state management. We’ll review common obstacles such as locked and broken states, as well as behaviors and emergent (anti-?) patterns in development related to maintaining a consistent “state triplet”.
Footnotes
[1] There may be more than one under a given corporate entity for a variety of purposes; for the sake of this post, consider we’ll consider a single IaC Ecosystem per organization (maybe, per service provider) for the purposes of delivering the organization’s digital products and services.
[2] It’s neither impossible nor uncommon for a centralized team that primarily develops inactive remote modules to also own active modules or for a distributed team to own inactive remote modules; in some cases, it may even be recommended or necessary.
[3] Although it may be appropriate for sufficiently small IaC ecosystem footprints or teams in an early stage of IaC maturity. I may discuss this in greater detail in a subsequent post.
[4] E.g. some fancy, custom modules, nested locally, with the lowest, most generic modules hosted remotely, or vice versa.
[5] The costs and benefits are very similar as are the types of organizations that would most likely take on the approach. Indeed, this is likely to be a more common (and comfortable) approach to NLRN in practice.
[6] Which, spoiler alert, may not even work in many service providers if Terraform is issuing perfectly equivalent IDs (rather than varying them e.g. with a pseudorandom suffix or by having them generated by the service provider) for everything as some resources have universally unique IDs that are reserved even days after they are issued a teardown command.
[7] Develop works in progress on feature/bugfix/spike branches? Yes, please! Leverage PRs and I beg you, peer review your code!
[8] Ideally, this is enforced through branching permissions and “hardcoded” into e.g. groovy code; for convenience, I would suggest providing some way for changes to the development environment be at least plan-able from e.g. a feature branch. In the end, discipline and vigilance may win out, tactically.
[9] This may actually be a requirement for security purposes, in which case, let that be the justification for fetching these remote configuration files, rather than some fetish with single repository solutions.
[10] Note, the state will not be checked against what’s live, it’ll only check against the state tree found at the supplied backend. Note also that a plan will also lock the supplied backend, so this shouldn’t be done willy-nilly; because of this, I do recommend such a “plan only” pipeline to share a run ancestry with the “plan and apply” pipeline, if possible.
[11] While an OSS approach is a good way to keep things moving in a highly distributed engineering environment, this isn’t open-source software; this is your meal ticket. Make sure your consumers are updating their configurations in a reasonable fashion.