Drift Detection with IaC in Cloud Environments

Animesh Rastogi
Google Cloud - Community
4 min readJan 31, 2023
Drift Detection in IaC

Gone are the days when your entire IT infrastructure used to be managed by some brave system admins and a barrage of shell scripts and cron jobs.

In the current Cloud Computing world where APIs and event driven are the norm, there was a need to optimise designing, developing and maintaining your Infrastructure.

That’s when Infrastructure as Code(IaC) came into the picture. Essentially, with IaC, you write simple configuration files to create and manage the lifecycle of all your infrastructure resources. This solves 2 very critical problems:

  1. Creating Repeatable and Extensible infrastructure
  2. Changes to your infrastructure are peer reviewed in Git like your application code

These practices help empowering organisations on key metrics like speed of deployment, accountability, etc.

There are many tools, both OSS and SaaS, which allow you to create IaC like Terraform, Crossplane, Deployment Manager(GCP), CloudFormation(AWS), Pulumi, etc.

However, practically, it’s not very easy to just adopt one of these tools and practices. If your organisation has been creating infrastructure by hand using Cloud Consoles or CLI, it’s a huge change management effort to ensure every team follows strict IaC principles.

How IaC tools work?

In IaC, you write simple configuration files in HCL(Hashicorp Configuration Language ), YAML, etc which defines your desired infrastructure state. These files are then translated to the appropriate API calls to the underlying infrastructure provider which provisions the actual resources.

IaC tools like Terraform maintains a state file of all the resources created and managed by it. Any changes made outside needs to imported into Terraform state by writing code.

Common Problems when adopting IaC practices

  1. Convincing Developers on the need for IaC — Developers want to focus on their applications without worrying about it’s lifecycle. That’s why organisations need to adopt good Platform Engineering practices.
  2. Another tool in the stack whose expertise needs to be built
  3. Outages — When there are outages, your focus is to get your system up and running as soon as possible and suspending best practices is acceptable. However, backporting those changes back to the IaC state is another task to be undertaken.

As mentioned above, manual changes by team members can cause major drift between infrastructure as defined in the IaC state vs what is actually present. This can cause serious problems resulting in failed pipelines, lost configuration, security vulnerabilities etc.

Approaches to Drift Detection

Drift Detection seems easy but is a deceptively difficult thing to implement. There are 2 different approaches on which a drift detection solution works:

  1. Comparing state defined in the state file with what is actually deployed. Ex - driftctl
  2. Detect any changes made to infrastructure which was not deployed by the IaC tool

In this blog, we shall be focussing on the second approach since it’s easy to implement and works for every Cloud Provider.

Solution

In this solution, we shall make the use of Audit Logs in GCP to trigger alerts whenever someone manually creates a GCP resource.

Solution Map

Basically, Terraform authenticates iteself with GCP by impersonating a Service Account. Any resources created which are not created by this Service Account shall trigger an alert that to be consumed by any notifcation channel like Email, Slack, Pager duty, etc.

  1. Create a Notification Channel

Navigate to Notification Channels from the search bar at the top of the console. Configure a Notification Channel of your preference. I’ll choose Email. Add details as following

Creating Email Notification. Channel

2. Go to Logs Explorer and Build your Log Query

Write the following Query in the Query Editor and Click Run Query

logName="projects/<project-id>/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName=~".instances.insert" OR
protoPayload.methodName=~".Create." OR
protoPayload.methodName=~".instances.create" OR
protoPayload.methodName=~".Delete." OR
protoPayload.methodName=~".instances.delete"

protoPayload.authenticationInfo.principalEmail!="terraform-sa@gserviceaccount.com"
AND protoPayload.authenticationInfo.principalEmail!="<project-number>@cloudservices.gserviceaccount.com"
AND protoPayload.authenticationInfo.principalEmail:*

Replace your project-id, project-number and the SA that Terraform uses to authenticate in the above query. Once you run the query, you should start seeing a bunch of logs appearing. If you don’t see any logs, increase the timeframe.

Please note that this query does not cover all GCP services and it only supports creation and deletion events. I shall add support for more services and API actions later.

3. Configure Log Based Alerts

Create Log Based Alert

Just above the Log Entries, you will see an option to Create Alert. Click on it and Enter the Details.

Alert Policy Name: Drift Detection
Keep everything else as default and Select the Notification Channel Configured in Step 1.

That's it. You’re set.

Now Test out the setup by creating a VM with Terraform and one manually. You should recieve an email when you create manually

Conclusion

In a perfect world, all your team members follow IaC best practices and no infrastructure is provisioned manually. Ah! What a world that would be. But, in the real world, it rarely works like that.

This solution is an easy way to get started with Drift detection in your Cloud Environments. This example uses GCP and but a similar solution can easily be implemented using AWS Cloud Trail Logs.

As more organisations adopt IaC practices and tooling, a more robust real time drift detection solution needs to come forth.

References:

https://driftctl.com/

https://cloud.google.com/logging/docs/view/logging-query-language

https://www.terraform.io/

If you know of any OSS tools which effectively Detects drift, kindly let me know in the comments.

--

--