Detecting and Handling Drift with Terraform and CircleCI is easy

Georgi Nikolov
News UK Technology
Published in
5 min readAug 24, 2021

In this article I will talk about infrastructure as a code with Terraform and the challenge of keeping things in sync with real world infrastructure, as well as tricks to detect undesired drift. I will explain how we approached the problem and describe the automated solution with a combination of Terraform and CircleCI orb’s.

Introduction

I work as an engineer on the Wireless digital team, a part of News UK Technology, and we have built an Audio Platform to serve live radio and on-demand digital audio content to listeners across multiple brands (e.g. talkSPORT, Times Radio, Virgin Radio).

My team powers mobile, voice and web products by exposing a GraphQL API interface. We chose GraphQL API so our clients can send requests to a single endpoint. It meant that we no longer had to worry about maintaining multiple REST endpoints and versioning. Also we can easily aggregate data from multiple sources into one convenient API. For example, our implementation allows us to serve the programme schedule alongside some catch up content for previous days’ programmes, that are stored in different databases, or our podcast content that is managed by a third party vendor.

Problem

At NewsUK we use separate AWS accounts for our environments: pre-prod and prod.There are multiple teams and stakeholders with access to these accounts for different purposes. And because we are just one team of many, one of the challenges is keeping an up-to-date record for all deployed infrastructure and changes (potentially conflicting) made as part of company-wide policies.

For some context, recently our colleagues in a stakeholder team, in response to an organisation-wide cost saving initiative, implemented a change to revise the retention policy from ‘Never’ to ’90 days’ for CloudWatch logs, across all accounts. As their cross-cutting changes were applied outside our usual deployment processes, barring their excellent communication, we would have overwritten these changes with our next deployment. Overwriting a change like this has cost implications but it is not hard to imagine other seemingly innocuous changes or rollbacks with more serious consequences.

Solution

When it comes to managing AWS infrastructure as a code, there are two main candidates — AWS CloudFormation and Terraform.

Both tools are similar — you use their high-level configuration syntax to describe your infrastructure. At NewsUK we opted for Terraform. Terraform similarly uses configuration files to detail the infrastructure setup, but it goes further by being both cloud-agnostic and enabling multiple providers and services to be combined and composed. Having multiple providers we can easily manage our monitoring through its NewRelic provider.

So any changes made to our infrastructure, outside the Terraform ecosystem will be detected as drift.

The cool thing about Terraform is its distinct steps of planning for and applying the infrastructure changes. For example the following code will create S3 bucket:

With terraform plan you can see beforehand what Terraform will do when you call apply:

The important thing here is that, in order for Terraform to accurately create the plan, it will first refresh its state with the real-world infrastructure. The state is a file, something like a blueprint of the managed infrastructure. In our case, that file is stored in a remote host — S3 bucket. The refresh happens automatically just before the plan.

We have integrated this workflow into our CircleCI pipeline. Terraform plan and apply commands, being different jobs.

Knowing all of that, the only question was, can we take advantage of it?

And the answer is YES. If we want to know if anything has changed unintentionally, we would see that in our plan. Because If anything was changed intentionally, then it would have been in the source code and Terraform would not plan to do anything.

Okay but at the moment we are inspecting the plan ourselves and obviously this is not convenient for automation?

Lucky for us Terraform’s plan command has an option called -detailed-exitcode. From terraform Docs:

With this we introduced a scheduled workflow in CircleCI that runs once a day, every morning, checks out the source code and runs terraform plan — detailed-exitcode. Based on the exit code, using NewRelic Event API, sends a custom event on which we have set up alert conditions and receive slack notification.

One cool feature of CircleCI is that it provides a way to create shareable packages of jobs, commands, called orb’s. This way we extracted all that functionality in a private orb that can be then distributed to all of our teams.

In Conclusion

At this point you might be thinking — “what about the infrastructure that is not managed by Terraform”? Well indeed that cannot be covered by our solution. After all, Terraform doesn’t know anything about infrastructure resources outside of its ecosystem. The approach would be to import the already existing infrastructure into Terraform state and only then our solution would work. There is a terraform import command that can be used to import resources easily. You can check step by step guide here — https://github.com/ulich/cloudformation-to-terraform-migration

Having that automated and re-usable Drift Detection mechanism in place, you can be confident that you will be alerted as soon as any change is made to your infrastructure outside Terraform state, and you can act on it.

--

--