Managing a Terraformed infrastructure at scale: a feedback
Abstract
Infrastructure-as-Code (or IaC) has probably dramatically shaped the way how modern infrastructures are designed and managed.
We can go bigger and quicker than ever before.
From the early tools such as Puppet (2005) and Chef (2009) to the more recent ones like Terraform (2014) or Pulumi (2018), the way we are deploying and maintaining infrastructures has changed forever.
Right now, anyone could easily deploy a 3-tiered infrastructure in a few minutes with the right set of tools; however, the simpler it becomes, the harder it is to master.
Setting the canvas
We, at ManoMano, transitioned from bare-metal to the public Cloud in 2019, after one year or so of preparation.
Previously, we only had to provision servers with Ansible and that was it.
Our hosting provider took care of the rest: networking, hardware management and so on. But then every part of the infrastructure became our responsibility. And doing everything manually was not an option, we needed new tools to help us with that.
Enter Hashicorp’s duo: Packer and Terraform. Easy to learn and quick to put into action, the ideal candidates to fill the gaps in our tool-chain.
And so started our infrastructure-as-code adventure.
We are now in our second year in the Cloud and the results regarding our Infrastructure-as-Code are mixed.
Yes, we were able to quickly deploy, expand and manage our infrastructure (from a dozen of servers to several hundred) but sadly some choices made back then are now shackles around our feet, refraining us in our effort to innovate.
For instance, choices were made early on not to use public modules, which would have helped adopting some the already existing best-practices; but also delegating almost all the business logic to an in-house made tool, based on Jinja2 templates to actually generate Terraform files, which isn’t a bad idea in itself but was alas badly executed.
While it allowed us to quickly and surely deploy stacks with the same specifications, it made very difficult, near impossible, to deviate from these patterns without breaking everything.
Our recommendations
We have been using Terraform and Packer for almost two years now.
And while most of what will be stated could be considered as standard best-practices by some people, these are nonetheless still important to address.
This section will be divided in two parts, one for each tool.
Terraform
Outside of HCL’s (Hashicorp Configuration Language) specifications and the resources declarations, there are virtually no enforced conventions at all. You can even put everything in a single file if you wish to.
We ought to be careful when planning out any infrastructure managed with Terraform, because some mistakes can be everlasting and make your life a living hell.
Get to know the tool
You do not write code when using Terraform, you describe a state that Terraform should then achieve.
Trivial things you would expect from any programming language are not necessarily available, like conditional statements for instance (even though some workarounds exist).
I won’t go into details regarding this topic, you can find numerous resources online, such as Gruntwork’s “tips & tricks” blog post or Hashicorp’s very own “Hitchhiker’s guide”.
Another thing to keep in mind is that Terraform is a passive tool. You execute it, it checks the infrastructure’s state and does what must be done to get things back on track with what you described.
Tips: Using the likes of AWS Config to check up on your infrastructure is a good method to detect manual actions (and possible security threats).
Keep it “dry”
We don’t write code per say, but we should still prevent duplicating things.
The more your infrastructure will grow, the harder it will become to maintain. Thankfully Terraform offers solutions, like modules.
As the name suggests, modules are a collection of resources bound together to represent a single unit, or business need. From something as small as an AWS bucket to a much bigger Kubernetes cluster, only you decide what to put in it.
In most cases, you would defer any business logic to them and keep your main files clean and easy to read.
There are two ways to create modules: locally (within the same repository) or externally.
Keeping them inside the same repository is not such a good idea, mainly because it makes controlling their lifecycle next to impossible.
Terraform can retrieve modules from any VCS provider and even from zip archives over HTTP. Then, why even bother trying to keep these locally ?
But what if you want to go faster or do not want to write your own modules ? Hashicorp got you covered with Terraform registry, a catalog of public and open-sourced modules.
I recommend checking the registry first before starting to write a module, with any luck someone already did one and would save you quite some time (and headaches).
Another way to keep things simple: use Terragrunt.
Initially developed as a wrapper for Terraform to provide features like remote backend before Hashicorp finally implemented most them into Terraform’s core, Terragrunt remains a more than useful tool for everybody wishing to go further regarding Terraform automation.
When your infrastructure reaches a certain size, using Terragrunt will surely be of great help to you, for quality of life purposes.
Like executing a cascading plan, where it will execute a terraform plan command for each stack, recursively. Or even pushing events via its internal hook mechanism.
If you wish to learn more about all the features, please defer to their documentation which will speak for itself better than I could.
Folder hierarchy
Remember I said that some mistakes can be everlasting earlier ? Having a bad folder organization from the start is one those mistakes.
It can quickly become confusing and/or complicated when the hierarchy does not match with the representation of your infrastructure anymore. Having a good logic on this aspect will gradually develop into a mechanical reflex and only help you better navigating the stacks.
Depending on the size of your infrastructure, this can be as simple as env/component or a more elaborated region/env/component.
While easily fixable when you have a small number of stacks, trying to refactor the folder organization when you have hundreds is not an easy task.
For instance, we recently decided to move from a env/region/app hierarchy to app/env/region in order to better match our recent goal to be multi-region.
Looks quite simple as is, but chances are that you use the same hierarchy for your remote states. So this means you should also migrate them, another tedious task.
In our case, it was even harder because we built up a lot of logic on this specific hierarchy for some other tools we are using …
So, think ahead and try to adopt a flexible and extensible structure in order not be stuck later on.
If the hierarchy is not mnemonic enough, you are probably not heading in the right direction.
Going beyond
Will come a time when you will probably ask yourself if generating Terraform files using a templating language or even build your own DSL on top of Terraform’s would be a good idea. And you know what ? You are absolutely right.
And you are not the only one. Hashicorp recently announced a “bridge” to compile CDK (AWS Cloud Development Kit) code into Terraform’s.
Another solution could also be to split the Terraform resources associated to an application, and move them directly into the application’s repository.
If you already use external modules, nothing to change except maybe your continuous deployment pipeline.
Don’t hesitate to check out the Cloud Native Bundles initiative and Porter, which recently adopted into the CNCF (Cloud Native Computing Foundation). It provides a modern approach to bundling up applications and their associated deployment tools (Terraform included).
Speaking of bundling up things, immutability is a big part of any modern infrastructure for some big reasons, let us speak about Packer now.
Packer
Because building images (be it for virtual machines or containers) by hands is really no fun and time consuming, finding a tool to do that in our stead is a must-have nowadays.
Packer is just that, and support a variety of providers: major Public Cloud such as AWS, GCP or Azure, and more traditional virtualization solutions like VMWare or QEMU. It natively integrates with popular provisioning tools too: Chef, Puppet, Ansible or even Salt.
Provisioning
While provisioning your images with Shell scripts can be fine, it really lacks in reproducibility and stability. Linux distribution (for instance) uses different shells and or different shell’s versions, which will really make your life harder.
Adopting Ansible, or one of the numerous alternative, will provide you with a stable provisioning platform on top of giving you advanced (and more user friendly) features.
Immutable images plays well with Ansible or Salt by nature, and also provide you with testing capabilities (i.e molecule for Ansible). Being able to test you changes locally is really a time-saver, and can easily provide you with a quality of code gateway in your continuous integration workflow.
And to finish, like Terraform’s registry providing community modules, both previously quoted tools has the same functionality.
Image lifecycle
Immutable images are like any other sort of artifacts, you need to manage their lifecycle. There is nothing more frustrating than having one named myapp or myapp-1607606514.
You can improve the situation with some really easy steps: tag your images (even if built manually), use a tagged revision of your provisioning code (for reproducibility and regression checks mostly) and apply metadata to keep track of the building environment (Packer’s version and so on).
With that done, finding an image within Terraform becomes much easier:
Packer allows us to pass variables at run-time, which is perfect to provide a dynamic version’s value based on either a tag or a commit hash.
Like so: packer build -var ‘version=v1.0’.
It is also a good practice to frequently rebuild your images (at least the one in production) in order to get the latest security patches and package updates.
Though, if you do so, do not use the force_deregister option as you will be effectively be deleting the old image. Which is likely to cause issues if you have some auto-scaling or other automated processes based on it.
Continuous Integration
Building a continuous integration pipeline around Packer is not that difficult once you have the configuration figured out.
In my opinion, the hardest part is actually coming up with a workflow suitable for your needs.
Mono-repositories are something I dread upon when thinking continuous integration, it make a simple workflow a lot more complex to manage.
And also makes the lifecycle management a real pain.
Having one repository per image is simpler to leverage and simplify quite a lot the workflow. To enforce guidelines across all your images, use Github Actions templates or Gitlab CI templates to centralize them. And if you have common scripts, a git module is good solution to avoid code duplication.
Do I need to test my provisioning before each build ? Should I make the new image available to all my environments the second if finishes being created ? How do I even test if the image has been correctly provisioned ? These are the questions we are all bound to ask ourselves, and for good reasons.
Pushing freshly baked image straight into production can be a disaster if something went haywire in the upstream process. To wit, having a final testing phase can be useful. Depending on the nature of the image, you could write simple tests to check for open ports, HTTP endpoints and so on, and if the image is tightly coupled with an application, running the application end-to-end tests would be even better.
Then tag and share the image when everything is ok.
All in all, what have been said here for Packer does also apply for every immutable image builder there is.
Take aways
In the end, the main point is to treat your infrastructure-as-code as you would with any other piece of software. This will come as no surprise to people with a solid development background, but not necessarily for others coming from the sysadmin world.
Having a proper lifecycle management and continuous integration will greatly help you keep things clean and tidy. And sticking to best-practices (when available) will ensure that your bricks remains maintainable on the long-term.
With these in play, a lot of opportunities will present themselves to you.
Such as using Terratest to control regressions, or write end-to-end tests by deploying a whole new bare bone infrastructure to ensure that everything still clicks together.
And lastly, the easier to manage your Infrastructure-as-Code will become, the less dependent on you the development teams will be.
This will allow you to move your focus elsewhere and for the developers to gain autonomy.
And that’s about it for today.
Writing this post has been a great exercise to help pin-point our current shortcomings, and set new goals, regarding our Infrastructure-as-Code.
And I do hope it will help you as well.
Our next steps are already established: complete overhaul of all our core Terraform stacks, and better continuous integration for earlier regression and side-effect checks.
Which will probably lead to a part 2 when we will be done with these.
Until then, take care and happy coding.