Trust but Verify with your Infrastructure as Code (IaC)

Your Cloud’s Missing Feedback Loop

Josh Armitage
5 min readSep 29, 2022

By the end of this blog you will understand, as a cloud platform team, why infrastructure drift undermines your initiatives, and how to use infrastructure drift tooling to understand and improve IaC hygiene across your AWS organisation.

Michael Feathers defined legacy code as code without tests. Extending his definition, we can define legacy infrastructure as infrastructure not managed via IaC.

IaC as your Bedrock

When looking at your cloud estate through an operational, security or delivery lens, IaC is the foundation you build on. Manual changes are operationally risky, it is borderline impossible to apply preventive guardrails to them, and it’s an amazing way to create snowflake environments that have unique quirks. Only through high levels of IaC adoption and proficiency, do you unlock the ability to truly leverage what the cloud offers.

I’ve seen ClickOps erode AWS organisations across many clients, lengthening feedback cycles and resulting in vastly lower quality outcomes. For certain levels of maturity, ClickOps is inevitable during incidents, but the quick fixes, if not brought back under management, quickly corrode your infrastructure.

“Shortcuts make for long delays” J.R.R Tolkien

When I was leading the platform team at one of the UK’s biggest energy providers, we were posed the following challenge.

Given autonomous delivery teams working in an AWS organisation, how can we ensure proper IaC hygiene is being followed?

The Base Concern

Any benefit you attempt to get from IaC is primarily constrained by the level to which your infrastructure has drifted. Any IaC based tool introduced, such as Checkov and Infracost, will only achieve its value proposition against non-drifted infrastructure. Therefore, having a strategy to manage and minimise drift is the only way to ensure a strong foundation, and get the expected return on investment when extending your toolchain.

The Three Forms of Drift

When looking at drift, there are three key forms:

  1. Infrastructure is present in IaC, but the configuration has been changed
  2. Infrastructure is present in IaC, but not present in the account
  3. Infrastructure is not present in IaC, but is present in the account

As a platform team, we were more concerned with the fact that drift within an account is increased, as opposed to exactly what drift occurred. The specifics of what had drifted we needed to provide to the delivery teams so they were empowered to efficiently, self-correct drift.

Much like we wouldn’t build our own IaC language, we needed a pre-built engine to drive our feedback loop so we could understand where the lapses in hygiene were occurring.

Enter DriftCtl

DriftCtl is an open source tool that gives you a full understanding of the drift within an account. Simply by feeding it Terraform state files and pointing it at an AWS account, it outputs a summary outlining how many resources are managed, and how many fall under each kind of drift.

By running driftctl scan you’ll see a list on uncovered resources, and a summary like the following:

...
Found 112 resource(s)
— 0% coverage
— 0 resource(s) managed by Terraform
— 112 resource(s) not managed by Terraform
— 0 resource(s) found in a Terraform state but missing on the cloud provider
...

Powerful within its own right, but from the platform team’s perspective this summary sitting in a CloudWatch log group wasn’t going to drive the changes we were looking to make. We needed to go one step further.

Scaling Across An Organisation

What we needed was the ability to centralise DriftCtl’s results in one location so we could understand IaC hygiene in the aggregate. To be able to look at the trend of IaC coverage and the three kinds of drift over time, not just a point in time snapshot per account.

For DriftCtl to give you accurate results, it needs two things:

  1. An assumable AWS IAM role in the account with sufficient privileges to scan the infrastructure, (you can find the policy in the documentation here). Relatively trivial to deploy and manage.
  2. The location of, and the ability to access all the Terraform state files pertaining to the account. A potentially complex proposition.

Finding The State

When looking at any AWS account, you can split resources into two buckets. Platform and application. The resources that are centrally managed and deployed, and the resources deployed by the team using the account.

In our case, we stored the platform resource state centrally making it easy for us to access. For application resources, we needed to build a mechanism for storing where the disparate state files are stored. As the simplest and quickest solution, we chose to store the locations in AWS Parameter Store, one parameter per account.

This introduced a manual onboarding for accounts, which you could automate through CI/CD pipeline instrumentation but rather than over-engineer a solution we decided to crank the mechanical turk to scale our solution.

Running DriftCtl

As DriftCtl comes neatly packaged in a docker image, we chose AWS CodeBuild as our execution platform. Which, in my opinion, remains the simplest way to run a scheduled container on AWS. By using Amazon EventBridge to schedule the daily AWS CodeBuild job, we wrote code to do the following:

1. Get an active account list for the organisation
2. For each account, retrieve the state file locations
3. Assume the role in each account, and perform a DriftCtl scan
4. Convert the scan output into an event and forward on for storage and analytics

This ensured that we were gathering data on each account at least once per day, a granularity that fit well with our purpose. We don’t need an update every 5 minutes, but we also don’t want to be waiting a week to understand when drift occurs.

To see example code for the conversion in step 4, see this repository.

Visualising IaC Coverage

With our engine purring along, we had the data flowing but no way of seeing what the data was trying to tell us.

In order to maintain low TCO of our platform, we’d already built internal dashboards on Amazon Managed Grafana, backed by Amazon Timestream. It was the simple process of handling the events from our daily DriftCtl run and storing them appropriately in an Amazon Timestream table. Which gave us a per account dashboard like the following:

Not only can we see the explicit numbers from the summary, but we can see the trend for each measure over time.

To ensure we were alerted on drops in IaC coverage, we also added daily Slack messages, where a drop in coverage would notify everyone in the channel. This also gave teams a direct link to the full report of the drift in any given account.

Feedback Loop Complete

Between Slack and Grafana, we now had the required feedback loops in place as a platform team to help empower higher levels of IaC adoption across the AWS estate.

Within 24 hours of ClickOps being used in an account, we would have a notification fire, alerting us and providing a full breakdown of the drift within the account.

Through Grafana we could see the trend of any account in our organisation, understanding whether our teams were practicing better hygiene as they became more familiar with Terraform as a tool.

--

--