Scaling from zero to (DevOps) hero with Atlantis
At GWI, we take our DevOps stuff very seriously. We are very vocal about this in interviews, meetups and internal workshops, so it may come as a surprise the fact that we don’t have a big DevOps team. Instead, we have a DevOps guild, which practically means that our engineers are responsible for owning a wide range of DevOps related tasks and responsibilities. These include writing Terraform code (thus owning big part of the infrastructure), building CI pipelines and executing performance and monitoring assessments. This lets our small DevOps team to focus on bigger tasks uninterrupted (sort of!), while providing guidance and mentoring to everyone else. This approach worked really well thus far, allowing us to scale from a bunch, to 80+ engineers with a dedicated DevOps team of 1~5 people. This is the first in a series of articles on the toolset that helped us along the way.
You build it, you own it
Guilds are not something new in the engineering world. In fact, when Spotify established and introduced its model a while back, many of us listened closely. For GWI, the DevOps guild brings its practises closer to the engineering teams, allowing everyone that is interested in the underlying technologies to dive deeper into them and lead by design. This doesn’t mean that every engineer at GWI is (or wants to be) a Terraform expert, but historically many people that enjoyed this type of responsibilities jumped along and created our first DevOps champions. Those champions were the first point of contact for anything DevOps related inside a team, and if things went south or further guidance was needed, then we would jump on the task. With great power comes great responsibility, so as time passed and people became confident with our best practises, champions became responsible for:
- answering internal questions around Kubernetes stuff
- ownership of the team’s infrastructure, essentially contributing to our infra repository
- releasing / scaling / debugging their own services
Of course, our DevOps team would jump in to help on multiple occasions, but this structure gave us the flexibility to work mostly with a smaller number of stakeholders, let’s say 10 engineers, while in the meantime we were focusing on bigger initiatives and workshops.
The tricky part
While the benefits were enormous (once again, we scaled from a bunch of engineers to a team of 80+ with a very small dedicated DevOps team), it is worth mentioning some difficulties we faced along the road:
- Lack of best practises. Having more specialists and no generalists with deep understanding of all the different stacks, lots of copy-paste in the code and all the ever-present freedom, anti best practises intruded.
- DevOps couldn’t review every single PR related to infrastructure. Each team was responsible to review internally first and ask for help if needed, but many times the reviewers weren’t experienced enough in Terraform, resulting in blind reviews. The fact that people were running
terraform plan/applylocally, wasn’t helping either.
- The DevOps team wasn’t the owner of the infra repository. Terraform is our source of truth for anything infra related, but as the department grew bigger, we had a lot of incidents of unapplied or unmerged Terraform code.
- Upgrading anything in a unified way, from Terraform to the latest version of any provider, was a nightmare.
The major issue we had to solve late last year, was applying Terraform code from a central place — up until then, engineers were planning and applying code locally from their computers.
Atlantis solved many of the issues above. In its simplest form, Atlantis is a service that listens to your GitHub PRs for Terraform changes, and automatically runs
terraform plan on them. The output is pasted as a comment to your PR. It’s beautiful, really.
It may seem trivial, or simply a nice touch, but the above functionality provides some major wins:
Win 1: DevOps now have better and more easily accessible historical data about the state of the infra before applying your Terraform changes. In the past, this information would only appear on a local computer and barely be documented on a ticket or a PR.
Win 2: The reviewer can now see with a quick glance the changes your code will make. In the past, people would have to pull the changes and do a local plan by themselves. We can’t really prove it, but we believe that engineers mostly skipped this part :)
If the reviewer is happy, then the owner of the PR can add as a comment
atlantis apply, and Atlantis will apply the changes for him. In a follow up comment, you will find the results and the outputs of your changes. In an essence, Atlantis is a wrapper for your Terraform plan/apply commands, with some basic locking mechanisms in place.
The tricky part #2
Atlantis was a success for us, and the first step to having a centralised place where you can apply Terraform code. It was a big change internally, and from our experience, you can measure its success from the number of complains you get either because the technology isn’t working as intended, or because people don’t want to change their habits. Thankfully, in our case transition was painless.
What’s the catch you are asking? Well, Atlantis needs elevated permissions to perform almost anything GCP-related, raising security concerns (what if someone deletes the Terraform code of a cluster and then runs
atlantis apply?). Of course the same concerns apply to the engineers themselves (what if an engineer deletes the Terraform code of a cluster and runs
terraform apply?), but at least that way you can play around with IAM permissions and restrict impact to a single project. In Atlantis’s case, that’s not possible — instead, you add additional security mechanisms in place in order to mitigate damage:
atlantis apply only on approved PRs.
2. Request more than one reviewers either across the whole repository, or on certain projects that include critical resources.
3. Restrict user access to critical parts of your repo. In its most basic form, this means that only certain users can use atlantis on production projects. With a more sophisticated user management process in place, each team has access to its own projects.
4. Have multiple Atlantis instances, each one with its own permissions and rules.
Stay awhile and listen
Implementing Atlantis was the first step for the DevOps team to take back the ownership of the infra repository. While it was a successful transition, it brought to the surface many underline problems hidden from us (like Terraform bad practises) but most importantly it highlighted how coupled was the Terraform code with irrelevant parts of the infra (like GCP monitoring resources). All these issues were making engineers to struggle with Terraform code in order to monitor their applications — which led us transforming our monitoring stack as well! But that’s a story for another time :)