Stop being selfish ! — Open up Terraform to your team with Atlantis

Florian Dambrine
GumGum Tech Blog
Published in
8 min readJun 25, 2020

What is Terraform ?

Terraform is a powerful software from Hashicorp that allows you to write infrastructure as code in a declarative way to provision and manage any cloud, infrastructure, or service.

Terraform Workflow (Source: https://www.terraform.io/)

The terraform plan command (similar to git diff output) is used to create an execution plan. Terraform performs a refresh, and then determines what actions are necessary to achieve the desired state specified in the source code.

Output generate by terraform plan command (Source: https://github.com/dmlittle/scenery)

Once the execution plan looks good, terraform apply command is used to apply the changes required to reach the desired state of the configuration according to the plan.

Ops engineers are usually responsible of operating such infrastructure software, but what if we could change that and unblock developers to build even faster ?

Infrastructure ownership ?

It is well known that DevOps engineers are usually known as “Gods” due to their high set of privileges to operate a cloud environment. Tools like Terraform often requires this set of privileges specially if the goal is to capture 100% of resources as code (IAM, EC2, RDS, R53, …).

As powerful as it can be, Terraform can also be heavily destructive if not used wisely. This is why most of the time its usage is restricted to a limited amount of engineers (mostly DevOps) within the company.

Part of the daily routine of a DevOps / Automation engineer working with Terraform is to write a Terraform module (a set of resources logically grouped together) that captures the needs of a new piece of software in order to automate its setup.

Modules are what libraries are to software engineers — A set of functions that can be called with input parameters.

Let’s take an example to better illustrate this statement. Here is an architecture diagram of an existing AWS environment (green) and the infrastructure required by a new project (yellow) that needs to be captured as code. Let’s assume that the green environment is an image metadata API and the yellow one is a new feature to process images metadata asynchronously and store them in S3:

Infrastructure diagram of an existing environment on the left and a new project requirements on the right

The resulting Terraform module may look like this:

terraform-modules
├── autoscaling # Module used to setup green infra
│ └── *.tf
├── database # Module used to setup blue infra
│ └── *.tf
├── load-balancer # Module used to setup purple infra
│ └── *.tf
---------------------------------------------------------------
# New module
--------------------------------------------------------------
└── new-project # New module to setup yellow infra
├── s3.tf. #-- Create S3 bucket
├── sqs.tf #-- Create SQS queue
├── outputs.tf. #-- Return important values to the caller
└── variables.tf #-- Input variables provided by module users

A module can be instantiated that way using Terragrunt (a thin wrapper for Terraform):

# File: mycloud/production/async-images/terragrunt.hcl# Terragrunt will copy the Terraform configurations specified by the
# source parameter.
terraform {
source = "<path>/terraform-modules//new-project")}"
}
# These are the variables (defined in variables.tf) we have to pass
# in to use the module specified in the source above.
inputs = {
environment = "production"
s3_bucket_name = "s3-async-images"
sqs_queue_name = "sqs-async-images"
}

Wait a minute… The skills required to write a Terraform module is surely on the Ops team, but any developer with minimum guidance and tooling can easily write the above file (<10 lines worth of content) or contribute to such a repository !

If tomorrow your team needs a similar setup for asynchronous page and video processing. It is as trivial as making new calls to the same module but with different input variables !

There are a lot of advantages opening up infrastructure as code repositories to your developer teams:

  • Raise awareness on the infrastructure setup helping developer to better understand how all pieces stick together.
  • Reduce time to troubleshoot issues as developers gained a global vision on the overall infrastructure.
  • Remove potential blockers by enabling more people to do Ops work.
  • Constraint Ops to think about real code practices for Terraform (Pull request, code reviews, CI / CD pipeline, no Terraform from laptop).

Let’s see how you can enable anyone in your team run Terraform/Terragrunt automations safely !

Here comes Atlantis !

Atlantis

Atlantis is an application for automating Terraform via pull requests. It is deployed as a standalone application into your infrastructure. No third-party has access to your credentials.

It listens for GitHub, GitLab or Bitbucket webhooks about Terraform pull requests. It then runs terraform plan and comments with the output back on the pull request.

When you want to apply, comment atlantis apply on the pull request and Atlantis will run terraform apply and comment back with the output.

Atlantis Pull request review workflow on Bitbucket
Atlantis Pull request workflow on Bitbucket Cloud

I won’t detail much how to get Atlantis up and running as it’s already well explained in the documentation. Here is a high level overview of how Atlantis interact in your workflow:

Atlantis Workflow

At GumGum the Machine Learning Engineering team operates an infrastructure made of 80+ Terragrunt instantiation (provisioning ~1500 AWS resources) using 20 Terraform modules across multiple regions and accounts.

From a simple Cloudwatch Alarm to the Confluent Platform, Terraform manages our infrastructure on a daily basis.

Given that the volume of resources managed as code and as a former Senior DevOps Engineer who transitioned back to Software engineering (at 30%), it was more than obvious to enable the team to do Terraform on their own using Atlantis.

Because the purpose of this post is “stop being selfish !”, I will share some interesting tweaks we did to run Atlantis at scale.

Custom Workflows with Atlantis

As mentioned earlier we use Terragrunt to keep our Terraform code DRY and maintainable but can Atlantis fit our needs ?

  • Yes ! Atlantis developers thought about this use case and offered a way to define custom workflows ! It’s a matter of adding the Terragrunt binary to the official atlantis docker image

Here is the custom Atlantis docker image we use in production that you can find on Github:

### Dockerfile# Grab image to collect tools from
FROM lowess/terragrunt:0.12.24 as tools
# Official atlantis image
FROM runatlantis/atlantis:v0.13.0
COPY --from=tools /usr/local/bin /usr/local/bin
COPY --from=tools /opt/.terraform.d /opt/.terraform.d
COPY --from=tools /root/.terraformrc /home/atlantis/.terraformrc
COPY --from=tools /root/.terraformrc /root/.terraformrc

Rendering atlantis.yaml on the fly

Atlantis relies on an atlantis.yaml configuration file, located under your repository root folder (this is what defines which repositories should be under its governance).

If you want the ability to run Terraform modules individually you will need to list every single path where you have module instantiations...

A reason to do that is to speed up Atlantis plan / applyas it will just act on an individual module instead of trying to plan-all / apply-all an entire folder of modules (Does not hurt but runs for modules that were not changed)

### <repo> terragrunt layout<repo>/<path>/prod/cloudwatch-alerts/
├── dynamodb-pages-reads
│ └── terragrunt.hcl
└── dynamodb-images-reads
└── terragrunt.hcl
### <repo>/atlantis.yaml---
version: 3
automerge: trueprojects: - name: dynamodb-pages-reads
dir: ./<path>/prod/cloudwatch-alerts/dynamodb-pages-reads
workflow: terragrunt
autoplan:
enabled: true
when_modified:
- "./terraform/modules/**/*.tf"
- "**/*.tf"
- "**/terragrunt.hcl"
- name: dynamodb-images-reads
dir: ./<path>/prod/cloudwatch-alerts/dynamodb-images-reads
workflow: terragrunt
autoplan:
enabled: true
when_modified:
- "./terraform/modules/**/*.tf"
- "**/*.tf"
- "**/terragrunt.hcl"
...
[A LONG LIST OF PROJECTS]
...

Because it’s just inhuman to maintain such a file manually (ours is ~400 lines), here is a Github Gist Snippet that you can use to generate your atlantis.yaml file with Pre-commit or a simple shell script:

Gist for atlantis.yaml auto-generation

Speed-up Atlantis plan / apply

With Atlantis v0.13 release there was the introduction of parallelism options (parallel_plan / parallel_apply ) 🎉 Yeah !

Unfortunately this feature is for now limited to Terraform workspaces which are not used by Terragrunt… As per this Github discussion on #260 it seems doable “in theory” to do it with Terragrunt but unfortunately I did not get luck and ran into unreliable Terraform init steps…

Feel like we can’t do better ? Well there are other knobs to tweak !

Thanks to this great article from Hashicorp blogpost (Which I recommend you to read) We decided to pre-bake all the plugins we are mostly using in terraform so that they don’t need to be downloaded from the Hashicorp registry.

Given that it is a recommended practice to constrain only the minimum allowed version in modules, we can easily pre-download the ones we use all over the place in order to make them directly available to Terraform. You can find an example in this Github repository.

This literally speeded up Atlantis feedback loop on pull requests !

Conclusion

Running Terraform from CI / CD pipelines is far from being trivial if you try to do it on your own… Fortunately Atlantis came to the rescue to ease this process.

As of today I have seen great engagement and enthusiasm from my peers discovering DevOps daily basis operations !

If you want to be successful with this product here is my two cents:

  • Run a POC on Atlantis (You deploy it locally)
  • Identify low risks infrastructure pieces that can be controlled by Atlantis workflow
  • Onboard your developers by doing a high level knowledge transfer on Terraform / Terragrunt
  • Make sure to require approval of an Ops person before applying code in production
  • Enjoy a better experience and a full developer worrkflow moving forward !

--

--

Florian Dambrine
GumGum Tech Blog

Principal Engineer & DevOps enthusiast, building modern distributed Machine-Learning eco-systems @scale