State Drift Detection using Terraform

Your infrastructure, just like the real world, is constantly changing. But differentiating between an expected and an unexpected change can be difficult. How can you tell them apart? A firewall rule change may be intentional and expected or it may be the initial sign of a malicious activity. Distinguishing between these two types of change was a challenge we faced early on in ACL, but surprisingly, no great solution existed for it.

Distinguishing between expected and malicious activities is a difficult task

Sure there are lots of solutions that can monitor your infrastructure and notify you of changes, but no great solution that can monitor and filter for events you care about. This is because most products and services do not understand your infrastructure’s context and requirements. They can either alert you of universally suspicious activities, i.e. non-MFA logins, or simply create an easy way to see infrastructure changes so you can spot unexpected activities manually, i.e. firewall rule changes. They focus on events and making it easy for you to digest them, rather than focusing on the state of your infrastructure and comparing it against your intended state.

Imagine you had an event stream for important actions in AWS, such as AWS CloudTrail. You could see lots of firewall (security group) changes, lots of permission (IAM) changes, but how will you know at end-of-day that your infrastructure is in the state you expect it to be? That’s where event stream based solutions begin to fall apart. They can be valuable when each event is focused on suspicious activities, but when regular activities are mixed into it, an event stream provides minimal value.

AWS CloudTrail shows you everything, irrespective of impact level or intention

At ACL, we weren’t content with accepting this limitation and continuing to develop our infrastructure without assurance that our infrastructure will remain in its intended state. We wanted to be rest assured that at end of the day, everything was setup as intended by our infrastructure code. To address this at ACL, we devised a simple technique with Terraform which we internally call State Drift Detection.

Terraform — Infrastructure as Code

At ACL we’re a DevOps cultured company, and part of that translates into ACL developing its infrastructure via code. When it comes to AWS infrastructure management via code, there are two primary candidates: AWS CloudFormation and Terraform. At ACL, we’ve opted for Terraform. Both of these tools, at their core, are quite simple. You use their syntax to define the sort of AWS resources you’d like to create and voilà it’s done.

“Write, Plan, and Create Infrastructure as Code”

For example, the following code will create a Security Group named allow_all with access on port 80 from your VPC’s network. Not a secure example, but keeping it simple to demonstrate the idea.

resource "aws_security_group" "allow_all" {
name = "allow_all"
description = "Allow all HTTP inbound traffic"

ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["10.0.0.0/16"]
}
}

What’s interesting about Terraform is its distinct steps of planning and applying. You can see beforehand exactly what Terraform plans to do in AWS, and if you are happy with it, you can apply it.

$ terraform plan
terraform plan — intended change to be made by Terraform
$ terraform apply
terraform apply — the change made by Terraform

After applying, Terraform maintains a snapshot of the state of resources it provisioned. This is stored in a .tfstate file and is used as the baseline for future plan and apply executions.

Once applied, a .tfstate file is made which captures the resources Terraform tracks

In order for Terraform to accurately tell you its plan, it will first refresh its state with the real-world infrastructure, so it can tell you exactly what it will do when applied. This was a feature I did not truly appreciate at first.

After a while of using Terraform, I realized, if I ever wanted to know if something had unintentionally changed in our infrastructure, I just needed to run plan and see if Terraform intended to do anything. If anything was changed intentionally, then it would have been in the source code and Terraform would not plan to do anything. However, if anyone changed any part of our AWS infrastructure manually, Terraform’s plan would identify it and let us know. In other words, if our AWS infrastructure drifted from its expected state, then Terraform’s plan would detect it.

Hmm… could I automate this too? What if I had our internal task runner (i.e. Jenkins) run this on a continuous basis and let us know if anything changed? That’d give me a lot of sanity! And that’s how our internal capability of State Drift Detection was born.

State Drift Detection

Now that you understand the problem we are trying to solve, let’s demonstrate how we can easily setup State Drift Detection using Terraform.

Terraform’s plan command has an option called -detailed-exitcode.

Full documentation can be found here

Using this option, you can use the exit code of terraform plan -detailed-exitcode to identify if Terraform plans to make any changes or not. Thus, to setup State Drift Detection, simply have your task runner (e.g. Jenkins) clone your Terraform repository and run terraform plan -detailed-exitcode on it and fail the task based on the exit code. Once you’ve got that going, hook your task runner into Slack and voilà State Drift Detection!

Jenkins will continually execute `terraform plan` and let you know if unexpected changes were detected.

The hardest part of getting State Drift Detection going is having your infrastructure coded in Terraform. If you already have that, you’re all set! If you don’t have that, you need to invest the time. I recommend prioritizing the AWS resources you care about and gradually bringing them into Terraform. Fortunately, Terraform has an import command to speed up the process, and unlike CloudFormation, you can import your AWS resources rather than needing to create new ones. In fact, this is why ACL chose Terraform over CloudFormation in the first place.

ACL’s Scenario

At ACL, we use Terraform to perform State Drift Detection on dozens of Terraform “projects”. As the person responsible for our infrastructure, it gives me a lot of sanity knowing everything is set up as intended. Without this, it’d be extremely difficult to know whether all our 5 production AWS regions + 2 development AWS regions are in sync and configured correctly.

We still complement our State Drift Detection with an event stream solution, since not all important AWS activities are state based, i.e. Root user logins. We simply don’t rely on that event stream to do everything for us. Combining State Drift Detection with a traditional event stream monitoring solution has proved itself invaluable in our ability to stay sane and secure in our fast-pace DevOps cultured company.

As a fellow infrastructure developer, I hope this technique gives you some sanity too!