Scaling Atlantis-CI Horizontally

Pardeep Bhatt
6 min readDec 29, 2023

--

In this blog, I will explain how Atlantis can be scaled horizontally to run on multiple nodes without losing any data or locking.

What is Atlantis ?

Atlantis, is an open source Terraform Pull Request Automation tool used to run terraform plan/apply commands on pull requests(PR’s).

Atlantis workflow as present on https://www.runatlantis.io/

Atlantis can be deployed using any of the deployment strategies; for this article, we will be mainly talking about the Roll Your Own strategy, where the Atlantis application runs in a node (VM in GCP World | EC-2 in AWS World).

Atlantis, by default, works perfectly fine with a single node setup which handles lesser number of pull requests. The problems will start arising when the number of pull requests is increased and a single node cannot handle that much load. At that point in time, we will need horizontal scaling of Atlantis, i.e., we need multiple nodes of Atlantis running in our backend. Since Atlantis was designed to be run on a single node only, we cannot directly go ahead and roll out new nodes and let them serve the traffic. There are certain problems associated with it that need to be addressed before horizontally scaling Atlantis. Let us discuss these problems one by one.

Problem 1: Unavailability of plan files between multiple nodes

Whenever a plan request (atlantis plan) lands on an atlantis node (which is triggered automatically once we create a PR or push commits to an existing PR), there is a plan file (default.tfplan) generated on the node based on the changes made to a resource (e.g. plan file path in the node looks like /<atlantis-user-home-dir>/.atlantis/repos/<your-repository-name>/<pull-request-number>/<workspace>/<path-of-your-file-in-the-repo>/default.tfplan), which is used later in the apply request (atlantis apply). If we have another node running behind the same load balancer and the apply request lands on another node, since the plan file is not present over there, it will result in an apply error. That is why, if we are having a multi-node setup of Atlantis running behind a load balancer, it is important that the data between those nodes be shared with each other.

Problem 2: Atlantis Locking

There are two types of lockings used by Atlantis to prevent plan/apply operations on a single resource from the same or multiple PR’s.

Problem 2.1 Directory/Workspace Locking

When plan is run, the directory and Terraform workspace are Locked until the pull request is merged or closed, or the plan is manually deleted.

If another user attempts to plan for the same directory and workspace in a different pull request they'll see this error:

Example locking error as present on https://www.runatlantis.io/docs/locking.html

This locking information is stored by default in Bolt (an embedded key/value database for Go), which is local to the node where the Atlantis application is running. In the event that there are multiple nodes running, this locking information will not be available to other nodes, and hence there is a possibility that on a single resource, multiple plan/apply operations can be run at the same time from different PR’s.

Problem 2.2 Repo level locking

Whenever we try to run consecutive plans or apply operations from a single PR without waiting for the response of earlier operations to come, we usually get workspace locking errors (as shown below).

This happens because Atlantis, before running a plan or apply operation on a resource, checks if there is already a plan or apply operation running for the same PR by checking the entry for that PR in an array, which is again local to the node where Atlantis is running. So with multiple nodes, there is a possibility that from a single PR, multiple plan or application operations can be triggered, and if they are landing on different nodes, this can lead to resource inconsistency or result in an error or failure.

Solution 1 | How to fix unavailability of plan files between multiple nodes

The Network File System (NFS) protocol is used to share plan files between multiple nodes. A visual representation of the architecture looks like

In the above diagram, the two Atlantis nodes (NFS clients) instead of storing plan files in their locals are storing them on an altogether different node (NFS server) and are reading them again from the NFS server only. This scenario can be easily related to where we have two servers of the same application running and both are connecting to the same database for storing or reading any form of data.

For achieving this setup, apart from the two nodes where the Atlantis application is running, we need one more node, which is going to expose one of its directories as a NFS server. For configuring the NFS server, run the below-mentioned script on the node that is going to act as an NFS server.

sudo apt install nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
MNT_DIR=/atlantis-data-dir
sudo mkdir $MNT_DIR
sudo echo "$MNT_DIR atlantis-nfs-clients.com(rw,sync,no_subtree_check)" >> /etc/exports # atlantis-nfs-clients.com is the domain which points to atlantis nodes(clients).
sudo exportfs -a

This script will perform following actions on our NFS server node

  • Installs the NFS Server.
  • Starts the NFS server.
  • Creates a directory named /atlantis-data-dir under root directory(/) of your node.
  • Configure the directory /atlantis-data-dir to be exported by adding a entry in /etc/exports file.
  • Applies the new config via exportfs -a.

For using the exported directory /atlantis-data-dir from the NFS server, we need to configure the NFS client on our Atlantis nodes. For configuring the NFS client on Atlantis nodes, run the below-mentioned script.

sudo apt install nfs-common
sudo mount atlantis-nfs-server.com:/atlantis-data-dir /<atlantis-user-home-dir>/.atlantis # atlantis-nfs-server.com is the domain which points to atlantis NFS server node created above.

This script will perform following actions on our NFS client node

  • Installs the nfs-common package, needed for setting up NFS client.
  • Mounts the directory /atlantis-data-dir exposed from the NFS server on to the .atlantis directory present inside the home directory of the atlantis user. This path is the same path described in the above Problem #1 section, where Atlantis used to read and store plan files. Now, instead of storing plan files in the local storage of the Atlantis node, we will end up storing this plan file on the NFS server node, and later, during the apply stage, instead of reading the plan file from the local storage of the Atlantis node, we will end up reading the plan file from the NFS server node.

We can also refer to the official document to view all the options available while configuring the NFS server and NFS client.

This way we are able to share plan files between multiple nodes and this fixes our Problem #1 🎉

Solution 2.1 | Directory/Workspace Locking information on Redis

By leveraging Redis, which is supported by Atlantis as one of the locking DB types, a Redis will be setup, and all the Atlantis nodes will be connected to this Redis only. This way, all the locking information will be stored in this Redis only, and hence, locking information will be able to be shared between multiple nodes, and locking errors will be thrown on PR’s if there is any other PR holding lock on the resource. This fixes Problem #2.1 🎉

Solution 2.2 | Repo level locking information on Redis

As discussed above, this locking information is by default stored in a local array. This will need to be moved from the local array to an external Redis (same as the one used for storing Directory/Workspace Locking information), which involves the code base changes as present in this commit, and this customized build of Atlantis will be used to deploy Atlantis instead of an open source one.

Note: The changes in this commit do not have unit tests written for them and are backward incompatible if anyone is using bolt-db, and hence require more work before pushing it back to open source Atlantis.

This fixes Problem #2.2 🎉

By making all of the above changes, Atlantis can now be scaled horizontally to run on multiple nodes. 🎉🎉🎉

--

--