Can Atlantis make Terraform great again?

Published in

swile-engineering

9 min readSep 21, 2022

Quite early at Swile, even before our switch to Kubernetes was even a thing, the Infrastructure as Code approach has been adopted as one of our standards. It was the only way, given the scale at which Swile intended to operate, to avoid creating technical debt in the long run for our infrastructure assets. In case you didn’t pick up on that industry practice, here’s a quick refresher.

Infrastructure as Code refresher

Infrastructure as Code (IaC) is the management and provisioning of infrastructure through code instead of going through manual actions and processes. It brings in:

Version Control: Since you describe your infrastructure as lines of code, instead of clicks on web ui, that means you can commit what your infrastructure looks like on a version control system, like git.
Collaboration: Now that your infrastructure is defined in files viewable by your teams, they’ll be able to review what has been done, improve it, and iterate on it in the future.
Automation: Now that your infrastructure is defined as code, it also means that we can setup automation for it. Automation can be useful to execute your code for instance, and lower (or remove) the risk of human error in the process, but also to test your code!
Drift Control: Now that you have defined what your infrastructure should be, you can also use that code to detect, prevent and even remediate possible drift happening between what your infrastructure should be and is committed to your version control, and what your infrastructure is at a given moment.
Modularity: While this aspect isn’t inherently tied to IaC, it makes modularity and splitting of your infrastructure as repeatable and configurable pieces of code so much easier. It can considerably improve your provisioning time for an infrastructure asset, or for a service as a whole.

IaC has become a standard in our industry for some years now, and several approaches exist to accomplish that. One of them is the imperative approach, often represented by tools such as Ansible or Chef. It defines the specific commands needed to achieve the desired configuration, and those commands are then executed in the correct order.

The other approach that is most commonly used by recent IaC tools is the declarative approach. In this one, you define the desired state of the system, including what resources you need and any properties they should have. Once done, the IaC tool will configure it for you, determining necessary steps to achieve that desired state automatically for you. This is the one we will focus on today.

And in the cloud native world, one of the tools that established itself as a reference in the domain is Terraform.

Terraform explained

If we stick to the way HashiCorp (developers behind that beautiful piece of software) defines it:

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. It comes packaged as a single binary and can manage existing and popular service providers as well as custom in-house solutions.

Long story short, Terraform allows you to easily create cloud infrastructure assets, configure tools of all sorts… If it has an API, you probably have a Terraform provider somewhere on the internet to talk to this API and define your configuration through IaC files.

How does it work?

After writing your piece of code, the user usually runs a command “terraform plan” to verify the changes and then another command “terraform apply” to make those changes live.

For an individual or a small team, this workflow works just fine. But as soon as the number of Devops, SREs or Software Engineers in charge of your infrastructure starts to grow, managing all the changes and pull requests (PRs) become very difficult for the following reasons:

To review a terraform Pull Request, you want to see what the terraform plan will be, which can make the review process tedious if it has to be done manually for every PR of every team member.
A “terraform apply” command can fail at times for diverse reasons, even if the “terraform plan” was successful. (You tried to delete a protected object, or perhaps there’s a race condition in what you’re trying to accomplish etc…). Henceforth, you usually do not want to have the “terraform apply” command tied to the merge of your pull request on your trunk branch. Yet, you still want to wait for your PR to be approved to apply your modifications. Otherwise…

Members of the organization should have proper permissions, with the least privileges, to run terraform plans and apply on the perimeter they are responsible for.
Different engineers may have different versions of Terraform on their computers, which can lead to some issues when it comes to how the code is applied.

When we began to encounter these issues in the Platform team at Swile a few times, we began to look for alternative solutions, without having to change our IaC codebase entirely.

And here’s come Atlantis

Atlantis is an open-source tool that allows you to address the aforementioned issues. It works by listening to events from Github (or, Gitlab, or other version control systems). When a PR is open on a repository it’s configured for, Atlantis will checkout the branch, lock the state of objects you are modifying to avoid accidental modifications through another PR at the same time, run the “terraform plan” for you, and post its output on the Pull Request as a comment for you to see and review.

If you are happy with what you see, and your PR has been reviewed and approved, you can then interact with Atlantis via a comment (usually “atlantis apply”) on your PR to apply your terraform plan. Atlantis will run the terraform apply command for you, and post its output on the Pull request again, to see if everything ran properly or not. If you are satisfied with it, you can then merge your PR.

That is the basic workflow you can have with Atlantis. This workflow is highly configurable though, for each repository you have (via an atlantis.yml file), while enforcing global policies for the whole organization as well.

Regarding its infrastructure, it’s as lightweight as it can be. Atlantis is a simple Golang app, and comes bundled in a docker image, in Helm charts, kustomize files and other flavors you can check out here. Atlantis can be run with a storage, to save the terraform plan already run in case of a crash from Atlantis, but it isn’t mandatory at all. From our experience, Atlantis rarely failed us, and the couple times it happened over months and months of usage, we simply re-ran our “atlantis plan” command in our Pull Request. Since this action is idempotent, the only downside is a few seconds wasted.

Atlantis various gotchas:

During our journey with Atlantis, we saw it evolve and grow, but also faced a few hurdles. Here’s a nuggets of advices you could take with you if you decide to give Atlantis a shot:

Before thinking about deploying Atlantis over your whole organization and all repositories hosting terraform code, we suggest you start small. That can be done via the flag “ — repo-allowlist” (or the environment variable “ATLANTIS_REPO_ALLOWLIST”).
If you work with different teams on Github, you could think about using the “ — gh-team-allowlist” flag. By default, any team can plan and apply. You can for instance decide that anyone can plan, but only some specific teams can apply terraform code.
Depending on your processes, your team member could use draft PRs before opening them for review. That is the case in our Platform team. To allow Atlantis to run on draft PRs, you can set the “ — allow-draft-prs” flag.
Github has some limitations regarding its API Rate limits, like any public API. If you use a Github user, you will usually meet the standard 5 000 API calls per hour limitation. This can be a limitation you meet depending on the user you use, how intense is your Atlantis usage etc… Hence why we suggest you to use a Github app instead for Atlantis, that are limited to 15 000 per hour per organization instead.

Our feedback on Atlantis

Atlantis project is very well documented and evolving quite fast. We are still using it to this day on some of our repositories. Although, we are progressively phasing it out in favor of an in-house solution revolving around Github Actions, Github environments and terraform. We managed to easily recreate what we believe Atlantis is doing very well (being able to decorrelate “terraform apply” from the PR merge, have feedback in the PR directly, trigger actions via Github comment, being able to apply on some selected environments to test your changes) and improved on what we felt was lacking for us.

The biggest challenge we met with Atlantis is the way it works regarding its configuration. You can either have a -central- configuration embedded in Atlantis, or have it defined in each of your repositories. You can also choose to go for an hybrid configuration, where you can define in the central configuration which key can be overridden at the repository level.

For us, the central configuration was a no-go since it was pulling away from the repository level the definition of the CD. It’s in our core belief that repositories should be as self contained and as self sufficient as they can be in order to avoid spaghetti code.

We also had the option to offload all configuration at the repository level, but the issue became that anyone able to open a PR on a repository had the ability to do whatever they wanted on any environment with the right configuration, which wasn’t ideal at all.

We finally studied the hybrid solution, but it quickly felt very cumbersome to maintain. We didn’t manage to have the level of flexibility required for repositories managing different types of assets with different purposes, without sacrificing the fine grained permissions we were cruelly needing.

Note on fine grained permissions:

At Swile, it’s not uncommon to have feature teams contributing to repositories of other feature teams. CODEOWNERS, branch protections, Github permissions and code reviews are making this process more reliable at all levels of course. With Github Actions environments comes the ability to require different levels of review approval depending on the workflow you trigger.

In our in-house replacement of Atlantis, that meant having the ability to manage this level of permission not at the file level (like it was with the atlantis.yml file), but at the repository configuration level instead. It meant that these permissions couldn’t be changed via a simple Pull Request, while still having configuration at the repository level. It is to this day the best of both worlds for us.

Conclusion

Atlantis is another very good product in the Devops/SRE world that will likely become in the future a key tool when it comes to terraform usage in a team as it keeps growing and evolving. The project is already quite mature and has a lot of traction.

If you do not have the needs or requirements we mentioned earlier regarding permissions, then do not think a minute longer and give a shot to Atlantis. That will definitely push your team and tooling in the right direction if you are operating bare terraform today.

On our side, our in-house Github Action version of Atlantis is fulfilling our needs, especially for core infrastructure assets and infrastructure assets shared by several micro-services.

Although, we strive to provide a single way to configure and define infrastructure code for our feature teams. We look closely several projects aiming to bring this level of infrastructure configuration to Kubernetes. Amazon is developing ACK (AWS Controllers for Kubernetes) for instance, while another project, Crossplane, is gaining both in maturity and popularity. More on that in another blog article!