Managing a large GitHub Organization with Terraform (Part 2)

Diego Morales
7 min readMay 23, 2019

--

This is the second part of a series focused on the caveats, problems and challenges we are facing managing a big GitHub organization (~300 people, 50+ teams, 2000+ repositories) with Terraform, and how we are tackling them. Here’s is our series outline:

Part 1 (link):

  • A very short intro to Terraform’s GitHub provider usage
  • Using and fighting with terraform’s modules for keeping standards
  • Love and hate on terraformed land: dammed list resources

Part 2 (this one):

  • Roadblock: a huge terraform state, and the need to split it
  • Using GitHub PRs and codeowners feature for an approval flow

Part 3:

  • Automating Terraform’s plan and apply on Azure DevOps Pipelines

Part 4:

  • Lessons learned about Terraform and Infrastructure as Code in general

Let’s start part 2.

Roadblock: a huge terraform state, and the need to split it

Terraform tracks the state of the resources it manages. It compares this known state to the resources described in the code and (when it refreshes the state) to actual configuration of those resources in the provider (like GitHub). Managing state is a big thing in Terraform, and it’s stored locally or remotely (in Amazon S3, Azure Blobs, Consul, or others).

When we started doing this, we had only one state, and it was fine. But remember I said our GitHub is quite big. As we quickly expanded our GitHub code base, things started become slow, very slow. Like taking 3+ minutes to just compile and validate code slow. And 30+ minutes to perform a terraform plan(with refresh turned on). This was when we had around 5k resources tracked in the state, and with 2000+ repositories and having each team permission or branch protection as a different resource. I knew that was only the beginning.

It turns out having such a huge state like that is a known Terraform anti-pattern. When you have a lot of terraform code, you are expected to break your configuration into smaller components, each one a separate state. In this context, what we did was split our states by team (and one extra state for org membership). In our code layout, each repo belongs to one team, so that distributes our repos into several states. Some are still quite big, some are very small, but each state became much more manageable. Although we have 50+ teams, so 50+ states now … life has consequences.

The first consequence was having to split the actual state without destroying and recreating every resource already tracked. Luckily the terraform state mv command support moving a state item to another state file. We had to make a script to generate all the state mv commands, and it took more than an hour to run, but it worked fine (the problem is that the larger the state, the longer each state mv command takes). If I ever need to do something like this again, I will strongly consider manipulating the state files directly, which will be much faster. They are just json files.

But there were several other consequences. You cannot reference resources in a different state the regular way, you will need a data source to read the state of some resource from the provider and then reference the attributes the data source returns. An example from part 1, repeated below shows the usage a team data source (lines 20–22) to fetch the id of a GitHub team defined in another state, and referencing the returned team id on team membership resource declaration (lines 29–33).

Example from the previous post that also shows data source usage

Also at that time we were already planning the automation pipelines we would use to validate and execute the proposed PRs. But now we are talking about a 100+ pipelines, two for each state (one “build” for validate/plan and one “release” for apply). We will talk more about that in the next post.

Now … besides the large state issues with Terraform itself, there was another major problem brought by our big org: the public GitHub API rate limit. Which is 5k calls per user per hour. By default, Terraform performs a refresh on every plan and apply command. It also has a refresh command.

The terraform refresh command is used to reconcile the state Terraform knows about (via its state file) with the real-world infrastructure. This can be used to detect any drift from the last-known state, and to update the state file.

So refreshing means hitting the provider (GitHub) API for each resource described in the code to check if its actual state matches the state file view of it. At some point before the state break, every plan command reached the GitHub’s 5k limit. But even after the break, it didn’t took so long for a single state (a special case one) to cross the 5k limit by itself. Now what?

Poking 5k resources on the API every time a small change is done is completely unreasonable. So we had to disable the refresh (passing -refresh=false on the plan command (and the apply command receives the saved plan output, so it does not perform a refresh). I am yet not sure of how much of a good or bad practice this is in the eyes of the Terraform community, but we are surviving well so far. We are doing zero changes outside of terraform code (or very close to that, exceptions just on known cases of resources not yet managed by terraform), so it is a reasonably safe trade-off. Later we selectively turned it back on for some cases (more about that on the next post, when explaining the pipelines).

Using GitHub PRs and codeowners feature for an approval flow

Enough talking about Terraform itself. Let’s enter the proccess we have built around it.

To manage an org this large, it’s not practical to have one single person or team managing the Terraform GitHub code. So many repos, configs, changes … we would end up filing tickets to that team, turning them into a big bottleneck. So we placed the code in a GitHub repo of its own and started accepting Pull Requests (PRs) for it, from anyone already in the GitHub org.

Since this code gives or takes permissions to repos, and we wanted a more tight control and tracking over those permissions (as explained in part 1), we needed a way to enforce review from a specific set of people, depending on which part of the code is being changed by the PR. That’s where GitHub code owners feature comes to play.

Code Owners allows you to “define individuals or teams that are responsible for code in a repository” (from the doc). It’s based on file patterns (like path or extension), and can set different owners for different paths. You specify it in a file named CODEOWNERS in either the root, docs/ or .github/ folders of a repo. An example:

GitHub CODEOWNERS example for Terraform GitHub management scenario

With that in place, any change in say, tribes/othertribe/team1 will automatically add the approvers-team1 github team for review (you could use any other team name). You can also specify individuals (without the @, that denotes teams) by username or email address (if associated to the user’s GitHub account). Order is important, last match takes precedence, and only the owners in that match are auto-called for review. And if multiple paths are changed, review is asked for each associated owner set.

You can then (and we did) make those reviews mandatory for merging PRs into the main branch, using branch protections, by setting the last option shown in the following picture:

GitHub branch protection configuration interface (partial screenshoot)

We organize our Terraform code file layout in tribe folders and teams subfolders, and have a specific set of owners for each tribe and team. So any change on the repos of that team will need the review of the owners. If a team is added, the tribe owners need to review and approve. If a tribe is added, it will fall to the top level approvers. And so on. Each set of owners must always have 2 or more people, to allow them to have vacations, take a day off, get sick, etc.

Note that since we are requiring one approving review only, just one (any) of the owners need to approve, but it must be one in each involved owner set.

Changes to the CODEOWNERS itself subject to the patterns, so in the displayed example above it will ask for review from the approvers-top-level team. You should ensure that changes to the approvers teams are subject to be reviewed by the same set of owners.

Below you can see a picture of the whole approval and execution flow for the Pull Requests made to our Terraform GitHub management repository. Consider this a teaser. We are going to explain our pipelines setup on the next post of this series.

Our GitHub PR’s approval and execution flow

It may be much desirable to detect simple kinds of changes and fast tracking them in a more expedite flow. We haven’t done that yet. Ideally that would be done analysing Terraform’s plan output file and taking actions based on what’s there, but that file unfortunately has a binary format, hence it’s not so straightforward.

Conclusion — Part 2

In this post we wrote about the need of breaking a huge Terraform state into smaller states, and our PR approval flow using GitHub’s code owners feature, with a teaser of the overall approval + execution flow.

This infrastructure as code approach to GitHub management led to and sustained an impressive amount of 500+ pull requests in 3 months. That with a team of usually two professionals (quite some time just one, and sometimes 3) mostly focused on this subject, including Terraform module development, workflow automation, and working out problems, both the general ones shown above and also the smaller problems with those PRs made by several tens of contributors, many of them new users of Terraform, or even new to Git.

That would be impossible without automating the PR validations, and the plan and apply steps of terraform in a workflow with execution pipelines (built on Azure Pipelines). We will show that in detail on the next post of this series. Stay tuned!

--

--

Diego Morales

SRE Tech Lead na Stone. Doido por automação, DevOps, Agile, churrasco e corridas de aventura.