Learn how we automate Infrastructure Governance on AWS at Voodoo

Alison Fourlégnie
Voodoo Engineering
Published in
4 min readJun 9, 2023

Once a month, our Engineering & Data team organize their Tech Talks. This event is a great opportunity to talk about what they’re working on and share their ideas. On this occasion, the Infrastructure Team tackled a very interesting topic: Infrastructure Governance.

This team, today formed by 7 top experts, has made a huge impact over the last two years. Previously the infrastructure of the company was neither uniform, nor under control. A lot of resources were created before any processes could be put in place. They had neither tagging policies on resources nor well-defined access controls: too many people simply had access to too many resources. The team wanted to get the infrastructure under control, so a lot of work needed to be done to fix all of this, a lot of which has taken place over the past few years.

Now the question is: How can they keep it under control? In this article you will see how our team prevents the infrastructure from going back to its previous state of chaos. They found two complementary ways to avoid this: detecting anomalies and preventing them.

  1. Detecting anomalies is the “easiest” way: you can create new rules, then implement exhaustive use cases to check if they’re being respected. This method is not intrusive, as there’s no workflow friction, during this process pull requests are not being blocked and nobody is being slowed down.
  2. Preventing anomalies: this method is a bit more difficult to set up and, on top of that, there is always a way to bypass these mechanisms. It’s very useful for some topics — such as security — but it doesn’t work on existing elements.

The first practice periodically generates reports, which are communicated on Slack (our internal communication tool). The reports indicate what’s not working well and help the team to fix all the issues they can find. They’re created by a customized tool developed by the team, written in Golang and run as Kubernetes CronJobs.

The tool’s automated tasks are divided into three jobs:

1.Job A: This task runs once a day. It uses Driftctl project to check if new resources have been created outside of Terraform stacks.

Example: Someone creates an EC2 instance outside of any Terraform stack via the AWS Console. It will appear in the next A report.
“One new unmanaged resource detected in example AWS account: EC2 Example”

2.Job B: This task also runs once a day. It will run the “Terraform plan” command in every Terraform stack that our team owns (~300). It underlines any changes in resources that are already managed with Terraform.

Example: Someone has an IAM role policy, managed by Terraform. They want to make changes to this policy. They do this manually via the AWS Console, instead of using Terraform code. Job B will add this to the next report.
“Drift detected in stack Example: 0 to add, 1 to change, 0 to destroy”

3.Job C: This task uses AWS resource explorer to retrieve AWS resources to compare their tags to our tag policy and then creates a report with any missing tags, wrong tag values, suggestions…

Example 1: If an EC2 instance is missing the `Project` tag which is a mandatory tag in our tag policy, we will see in the report:

“EC2 instance i-4583945839485 is missing “Project” tag”

Example 2: If a S3 bucket has the ‘Team’ tag which is a mandatory tag, but the value is not correct, we will see in the report:

“S3 bucket: “Team” tag is present but value (“blue”) is not correct. Correct values are : red, green, yellow.”

What’s next? Our teams are always coming up with new jobs that might be implemented in the future, including checking our github repository configuration, generating reports suggesting resources to clean, scanning Kubernetes clusters to find anomalies, to name but a few….

Want to join our Engineering & Data team? You can apply directly here: we are actively looking for several Senior Data Engineers (Ad networks) and Senior Data Engineers.

--

--