A common problem with infrastructure management is configuration drift, which is an issue manifested by a difference in the configuration across environments. It can either be a result of overlaying one configuration change over another one and over time accumulating residual elements or more commonly, it is caused by allowing manual changes, in parallel with the automation — usually, small tweaks here and there that are accidentally omitted in the automation scripts and don’t get propagated to higher environments. Typically, a good way to prevent configuration drift is to treat your infrastructure as immutable, meaning recreating it with the new desired configuration, rather than modifying the existing components. Tools like terraform, favor this approach, as described in The Tao of HashiCorp (folks behind terraform). You can learn more about the concept of configuration drift and immutable infrastructure from Armon Dadgar, the CTO and co-founder of HashiCorp in this video.
Generally, I’m a supporter of locking down the permissions to prevent manual infrastructure changes and allowing these only through an automated CI/CD process. There is, however, one case when I break this rule and it is prototyping. When dealing with a new application architecture or infrastructure platform, it is important to shorten the tweak-test-tweak cycle and in this case bypass some standard practices, in the spirit of learning speed. The downfall of this is a high risk of configuration drifting.
Another common issue with infrastructure automation is cutting corners, by leaving some parts of the infrastructure outside the realm of automation and relying on manual configuration. An example of this is setting memory consumption thresholds for alerts or whitelisting IP addresses on VPCs. These gaps are easy to forget, can go unnoticed for a long time and can cause headaches at a much later point when a new environment needs to be provisioned. While achieving 100% automation is sometimes not feasible due to limitations of the platform itself, it should always be the target. Any required manual modifications should be well documented and avoided like fire.
The project at hand was subjected to both cases above — the AWS platform was a big shift for the application and required rethinking its architecture. Rapid prototyping was necessary to make quick progress but experimenting with various architectures left a big mess behind. Also, the team was relatively new to infrastructure automation and unintentionally leaned towards manual changes which sometimes slipped our minds before they made it into the scripts.
To address these problems, the team adopted an approach that, at first, felt overly radical — we would periodically completely wipe our AWS account clean. It is important to make a distinction between this and performing a
terraform destroy which only removes the components explicitly managed by terraform. This would miss the exact point of this operation — identifying the areas which were not yet automatically provisioned. Our nuke of choice was an open-source, well-maintained tool named aws-nuke.
The first time we did this, our hands were shaking as we were confirming this very destructive operation, which the tool requires you to do twice, for safety measures. The result of this was disastrous — an attempt to reprovision the application with our automation scripts failed with multiple incomprehensible errors. Recovery took a better part of the week and team morale was low — it felt like we just shot ourselves in the foot and took a huge step back. However, what was not immediately clear, was that in the process of rebuilding what we just destroyed, we would benefit in two ways.
First, we discovered all the tweaks we’ve made and haven’t automated — some of them we completely forgot about, others were simply not documented. The nuke operation identified them so that we could make an explicit decision about automating these changes or documenting them.
The second lesson came after we performed the nuke again — this time with a bit more confidence in our automation. Applying the manual changes, now well documented, for the second time and later third and fourth, we experienced the pain of manual configuration first hand and following the principle of frequency reducing difficulty and the old CI/CD mantra — if it hurts, do it more often — we were strongly incentivized to double our effort to automate them.
If it hurts, do it more often and bring the pain forward.
— Jez Humble, Continuous Delivery
Over time, we performed the nuke multiple times and eventually, it ceased to stress the team. We grew an appreciation for it as another tool in our toolchain, that helped sharpen our saw. It increased the quality of our automation, as well as solidified our confidence in the process, in very much the same way frequent production deployments do in the traditional CI/CD approach. We eventually incorporated the nuking into our regular stakeholder showcases and finally, those became pretty uneventful and boring. The good kind of boring 😉
Happy nuking, everyone!