Advantages and Pitfalls of your Infra-as-Code Repo Strategy

Luis Sousa
Jun 25, 2020 · 9 min read

Following-up on the questions I’ve received on Introduction to Terraform Cloud post, today I want to tackle the never-ending debate around repo structures and some of the solutions and their inherent problems. For this conversation, I’ll refer to my IaC in a mostly Terraform oriented way, but most of what follows will apply well to other technologies.

Image for post
Image for post
My attempt at a drawing :)

There are 2 main schools of thought when it comes to infra-as-code repository structures:

  1. Mono Repo: One repository to rule them all, containing all your IaC, your modules and any auxiliary automation
  2. Distributed Repos: “self-contained” repositories that hold the components needed for the solution you’re trying to deliver and referring other repos for reusable components or data variables

Within these 2 categories, there is a wide number of sub-strategies for how to manage the lifecycle of your IaC across multiple environments. Across my career, I’ve seen both simple and truly horrible implementations and I hope to discuss some of the reasons why people might have made these choices and what to avoid if possible.

But before we proceed, it’s always helpful to remind people that the best solution is the one that fits your team’s needs and workflow, so take everything here for what it is, an analysis of professional experiences.

This Google Engineer said it best!

Jaana Dogan put it best with this quote. Any simple solution is hard and will require the processing of tons of information before it can be accomplished. This information can be in the form of requirements, processes, constraints and people needs but it can involve so much more.

With this in mind, let’s break down the requirements we usually have for our infra-as-code repos (this list is by no means extensive or in any particular order but it includes things that I usually look for in my workflows):

  • Ability to reference to a common “stack” or “base-config”. This usually happens if you have a terraform workspace or a separate statefile for your base VPC/Network that will provision your Subnets and base connectivity. Usually, these come as outputs rather than having to use data resources later in other parts of the code.
  • Ease of promotion of changes from environment A to environment A+1. Being able to quickly compare between environments and promote changes safely whilst still keeping the codebase readable is a must. In a growing environment, you’ll want to be able to quickly detect “what’s different” or “what’s changed”.
  • Harmony between software products. Nobody uses terraform in isolation, the same way you don’t use Ansible, Puppet or Kubernetes to run everything in your company. Your repo structure needs to accommodate all the different tools you use and provide engineers with an intuitive or at least well-documented walkthrough of how to use it or make changes.
  • Keeping it Simple but not too Simple. Referring back to Jaana’s tweet, “simple” ain’t easy or in most cases desirable. To over-simplify or optimize will most likely put you in a position that will hinder any meaningful speed. Design to retain the speed of change and ease of testing and modification. Retain your flexibility, ’cause one thing is for sure, new requirements and requests will always come around to blindside you. It’s called maturity for a reason :)
  • Flatten the learning curve. Not everyone will have the same background or level of experience. Design and document for intuitive use and ease of onboarding new engineers. An overly complex solution, “beautiful” as it may be, will suck hours out of your team’s day in onboarding, training and troubleshooting.

What started as a simple X vs Y question, now stands as a something more resembling this:

Image for post
Image for post
Whatever “lever” you try to pull on, will inevitably bring a compromise, for that is engineering at its core.

What not to do

I’ll start this road to a solution by listing out what I think we should avoid doing and why.

It’s tempting to use git submodules or terraform modules nested like babushka dolls but you can quickly find yourself in a rat’s nest of a situation trying to figure the combination of repository/branch/folder that a particular module lives in, and count your blessings if that module doesn’t reference another module that lives somewhere else. A module too far away from the codebase is a module that’s rarely updated, and when it is updated will, at best, slow you down, at worst, break your whole implementation.

I once saw an implementation that started by symlinking their modules to the environment containing folders they would be run in. If this doesn’t send chills crawling down your spine, then the nightmares of trying to find where a module is referenced and what your change might impact should. In the end, said implementation was riddled with provider-module.tf files linked to however many instances of the modules/environments there were. Just use a tool like Terragrunt instead if you value your sanity.

Pick a version control strategy and stick with it.

Do you want to use branches to keep your stable and “in-development” modules separate? Go for it! Do you want to use git tags to manage which version of the module is ready to deploy to nonprod? Amazing! Do you want to use folders inside a repo to keep modules organized instead of having a repo for each module? Whatever works for your particular case, as long as it fits your operating model.

But whatever you do, pick one and enshrine it in your development guide. If you want to switch lanes, do it with determination and move from one to another in a decisive fashion. But whatever you do, don’t use multiple strategies and please don’t use different strategies in different areas of your infrastructure covered by the same codebase.

Doing this is is a sure-fire way to confuse every new joiner and makes it virtually certain that someone will destroy or mangle something that’s going to ruin your weekend day or night. Keep it standard, keep it well documented.

What to do

Whether you’re the platform lead of a branch/channel/orgUnit or the Head of the DevOps/SRE/Platform/Infra/Networks/<insertNameHere> department that caters for a big E-enterprise, your users (read developers, testers, BAs,…) are the ones who will use your platform to provide value for your Costumers, and what they want will drive directly or indirectly, the happiness of your Costumers. Listen to them, collect their requirements (“I want to be able to easily tweak this non-breaking param on this environment for perf testing without having to go through 10 steps of hell”), and adapt to or with them. Happy Users will, in turn, develop better and make your Costumers happier.

If you use a git tag for your modules, make sure you use a tagging standard that makes sense. Whenever you see a git tag like 1.0.234 full of non-backwards compatible changes, address and improve it.

Implement controls where they will make the most impact. Keeping in mind that you’re working with engineers who are tinkerers and will always look for an easier and more convenient way to do their jobs if you put too many roadblocks in front of people, they will find ways around them. Protect the meaningful parts of your infrastructure and add useful reviews and approval gates where you think the decision-making points of promotion should lie.

Are your clone operations taking too long? Do you lose percentages of your day scrolling up and down or cd -ing in and out of folders in the same repo finding for what you know is somewhere in there?

If you do, it’s probably time to break apart your repositories and make them more meaningful and easy to use.

No need creating a wrapper terraform module that will cover all the use-cases of an ELB that your company might find or need. That’s what the provider does for you. Code instead for meaningful blocs of your infrastructure that you’ll be repeating across your estate. A few good examples are bastions, definitions of internal and public-facing services, database configurations but there are many more.

A good rule of thumb here is, if you spend more time trying to figure out what a module does than you’d usually spend in a resource doc page at Terraform Docs, then you probably don’t need that module.

What I like to do

I’ve worked for both Big-E enterprises and smaller/flatter teams and I’m a big proponent of the “it depends” answer. Depending on the size of the problem, the maturity of the teams, their processes and their tooling, I’ll usually go for one of two solutions with a few variations.

Image for post
Image for post

This usually works for medium to large enterprises that have different products or functional units, each one with their separate teams and products to support. In larger companies, this is usually accompanied by a custom assortment of tools and technologies. By keeping it to a two repo affair with proper versioning and merging permissions, you have the control for promoting changes, whilst still having standard “blocks” that can be reused.

Each “base” stack, will then have a vars file or folder that’s representative of the environment that’s being targeted. Since the automation is configured with a pre-defined list of envs, this convention is pretty stable and will allow you to set your pipelines to always target the right environment at the right stage in the lifecycle.

To refer to a specific version of the module, you could then use a code block like this:

module "ecr_<repo_name>" {
source = "git::ssh://<repository/terraform-modules.git//ecr?ref=stable"
environment = var.environment
name = "<repo_name>"
}

This will allow you to iterate on this module safely, whilst still allowing you to have multiple changes on both the infra and modules repositories. The drawbacks of this are that it will require you to double up on the PRs and review sessions, as you’ll have to PR your modules to your main branch and then again, PR the tag change into your main infra repo. But this is a very sturdy way of working which will protect you against rushed commits.

For smaller units/companies/projects, I’m a fan of the mono-repo approach with a master and develop branch splitting the prod from non-prod environments. This approach, coupled with co-located modules, branching permissions and automated applies provides a quicker turnaround for changes and it provides an easier way of working at the expense of some segregation of duties.

Image for post
Image for post

This approach allows you to streamline the development and promotion process but relies even more on automation to propagate changes across each “tenant”-”env” combo. With that in mind, the peer review and structure of the modules takes paramount importance, as you can see from this example as opposed from the prior solution, the modules are focused more on the “functional components” than the abstraction of the actual resources. Focusing on reusing entire stacks rather than load-balancers or security groups.

I prefer this approach if I’m rolling out the same stacks across multiple environments where the only difference is the base network or other minor parameters.

But whatever you end up choosing, keep in mind the Do’s and Don’ts that we’ve covered in this post so you can avoid some truly nightmarish refactoring exercises!

I was inspired to write this by the challenge it was to build a team of platform engineers from the ground up and then having to let it go. These may seem like “common sense” things to do, but when you’re down in the thick of it, it’s nice to have something to aim to. If you’d like to know more about the details on how to set this up, leave a comment or hit me up on Twitter. This is as much for “future me” as it should be for the “present you”. If you (dis)agree with me let me know, I’d love to hear your experiences and how I may have gotten something wrong or right!

Medium's largest active publication, followed by +752K people. Follow to join our community.

Luis Sousa

Written by

DevOps by day, nerd by night — I’m a self-taught cloud platform engineer that loves to tinker with new technologies and build things!

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Luis Sousa

Written by

DevOps by day, nerd by night — I’m a self-taught cloud platform engineer that loves to tinker with new technologies and build things!

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store