Config Drift: The silent killer

Published in

ELMO Software

4 min readMay 4, 2022

Now we all know IAC is king in the DevOps landscape. If you’re not building your cloud infrastructure using reusable IAC, then why even bother! IAC is the best thing to happen to cloud infrastructure since sliced bread but that doesn’t mean it's perfect. Config drift — The evil arch-nemesis of IAC. Config drift is where your cloud infrastructure differs from what was initially deployed and described in your IAC. This can lead to some pretty catastrophic failures.

Config drift can happen at an application level as well and doesn’t only apply to cloud infrastructure and IAC but its the one that’s caused me the most grief so it’s the one I’m going to talk about!

How does config drift happen?

Say you deployed your whole cloud infrastructure using multiple airtight IAC pieces. Everything is hunky-dory and running smoothly. When suddenly, One night at 3 am in the morning, you are alerted to a SEV1 incident. (Don’t laugh this has happened to me haha). You and the on-call app developer jump online and frantically debug the error, find a few issues and resolve them manually in the AWS console. Maybe you decided you needed to increase your instance sizes. Maybe you change a few IAM roles for debugging issues or maybe it was a security group rule update. What the change was or why you made it doesn't really matter at this point. What matters, is that the IAC and what is deployed to the cloud are now different — Thus introducing CONFIG DRIFT. *boooooooo*

The incident has been resolved and everything is working as expected — Hooray!?. But what's this? I go to make an important update to my IAC, run a state check, and now everything is out of whack. I don’t know what’s being changed from my code update and what's being removed/changed because of the config drift. Now I have no idea if what I'm going to push out is going to cause another incident! PAIN.

How to natively avoid config drift?

Most config drift scenarios can be avoided by setting in place a good DevOps culture and mindset in your teams and a great process for updating and deploying your IAC. Ensuring that all updates and changes are made via IAC and deployed as quickly as possible will keep your configuration drift to a minimum. In short, it gets you by for now.

The next step to minimising config drift is to set up CI/CD for your IAC. Now anytime you commit a change to your IAC, it's tested and deployed automatically for you. HOW GOOD! Now I don’t have to worry about manually deploying my IAC change because it's handled by a CD system and I’m much more confident that the change is going to be successful because of the CI. This gets you 90% of the way to eliminating config drift. There's still a chance that a rouge Developer or DevOps engineer can change things manually or the 3 am incident response warrants immediate manual changes. These things still won't make it back into the IAC.

There are many tools available to help you identify config drift:

driftctl for identifying drift between Terraform and AWS.
Kubediff for Kubernetes config drift
Cloudformation even has its own native drift detection now.

All of these will help you detect and action your config drift. You could run these tools on a schedule, like once a day, reporting and alerting if there is any drift. Now you can fix drift before it becomes too large and unmanageable.

Future State: ELIMINATE ALL DRIFT!

It actually is possible to completely eliminate drift and make it impossible to introduce into your cloud infrastructure. But it takes a bit of a mind shift as well as introducing a few concepts:

GitOps: ensuring that what you merge to your trunk branch, gets deployed
IAC: your beloved Infra As Code
Syncing service: The NEW Holy Grail

Now syncing is a new concept here. Syncing in this context is where we continuously check the live infrastructure and “sync” to ensure any changes made outside of IAC are reverted back to what is configured in the code.

So for example, the IAC states that my ASG scaling group scales when CPU hits 80%. Now someone decides that this should start scaling at 90% and changes the ASG config manually. The new syncing procedure will see that the IAC is not reflective of what's deployed and will revert the manual change back to 80%.

When syncing is implemented, this does however mean, that the 3 am SEV1 fixes have to be done via code during the incident otherwise they will all be reverted. This is a slight mindset shift during the bad times but it makes day-to-day life much simpler and easier.

My favorite tool for this use case right now is Crossplane. Crossplane is open-sourced and currently works with all the major cloud providers. I’ll be doing a bit more of a deep dive on Crossplane in a future post.

Summary

Manual changes = Config Drift = Bad

IAC + CI/CD = Good

IAC + CI + GitOps + Syncing = GOD TIER FUTURE STATE!!

I’d love to hear your comments on this. What has been your experience with with Config Drift? How did you get eliminate or get around it? Let me know in the comments and thanks for reading!

Config Drift: The silent killer

How does config drift happen?

How to natively avoid config drift?

Future State: ELIMINATE ALL DRIFT!

Summary

Written by Storkey