By: Regis Wilson
This article documents our journey to the cloud over the last three years using Terraform versions 0.6, 0.7, 0.8, 0.9, and 0.11. We haven’t yet made the leap to version 12. This blog post is meant to be a fun, tongue-in-cheek stroll through the ups and downs of sticking with an open source project from its early beginnings and watching it grow with us over the years. We do not intend any disrespect nor are we casting aspersions on the Terraform project or the HashiCorp software engineers.
The hope is that this blog post will serve as a humorous look back and a hopeful look forward, while also offering a collection of tips, tricks, and best practices for managing your cloud infrastructure. We mostly use AWS as our cloud provider, although Terraform supports most cloud platforms (and even non-cloud provider platforms!), but this article is based on our interaction with the AWS platform and API.
At TrueCar, we have used Terraform extensively for several years. It has been a wonderful tool, and we cherish it dearly. Our hope is that we can share some best practices and show off our scars so that others can learn from our mistakes and maybe chuckle along the way.
Too often, you read glowing reviews of how great Terraform is (and it is great!), but rarely can you peek behind the curtain and see the dark underbelly lurking about trying to destroy life as we know it. Well, Terraform is pretty great, but when it comes to managing a complex, dynamic infrastructure at scale, cracks in its flawless heroic image begin to appear.
In the Beginning
In the beginning before complex lifeforms evolved, there was the AWS console. It was easy to click around in the console and make new and exciting projects. If you wanted to change a parameter or update your environment, you could just click a few times and, presto, you’d have something better. But what if you clicked the wrong thing? Or what if you wanted to go back to the original configuration you had three weeks ago?
That was when the Command Line Interface (CLI) became interesting. This seemed like a viable plan, until you ran into edge cases and complicated data structures that had to get translated into option flags. Should you check if an object existed before creating a new one? How would you know if an action succeeded?
Then you’d break out the API documentation and start coding procedures and functional wrappers to build small pieces of infrastructure. This worked fine for small, well-defined projects, but it didn’t allow for business logic separation. What about IAM policy documents? How did you know that all the fiddly bits in the dependency chain were ready? How did you export information about a resource from one section of code to another? Should you pass around client handlers to every function invocation or use global variables? What if you needed to run the code twice? Were you sure it was idempotent? How did you know where you left off if the code crashed? Could you revert your build and start over? Where is Jimmy Hoffa buried? How does global climate change affect infrastructure as code?
And Then There Was Terraform
Suddenly, Terraform appeared out of the chaos and stood on a hill in the sunlight, posing dramatically with its curly hair blowing gently in the wind. Our hero comes to save the day! The promise of Terraform was too much to resist: We could abstract our infrastructure into text files and check them into our version control system (VCS). We could keep track of which infrastructure pieces were created and when (the so-called state file), we could automatically (in most cases) keep a dependency-graph of all the interconnected “fiddly bits” and glue together in one place, and we could run so-called plans to verify drift or to test changes to infrastructure before applying them.
We thought we were entering a new golden age by choosing Terraform as the base on which to deploy our automation platform. We could write a simplified template in an easy-to-read domain specific language (DSL) called Hashicorp Configuration Language (HCL) and use version control and continuous integration and deployment to test and roll out changes. We were in love and the world was our oyster.
We were so giddy with excitement that we swallowed whole the illusion that we could simply develop some “code,” run and test it in development, and then deploy it smoothly all the way up to production. Once that simple and painless process was finished (oh, the bliss of ignorance), we could go back to development and start adding and removing features and refactor as needed. But it turned out to be a mirage, like wandering in a desert and seeing water in the distance, only to find out what we thought would be glorious, clear, and clean refreshment was just vapors rising above the lethal hot sand.
Lesson Learned: Don’t believe that any new software, technology, or ecosystem will be a magical solution to all your problems. Any medicine that is actually capable of healing you probably contains a warning label that lists terrible side effects. Any “medicine” that doesn’t have side effects either doesn’t exist or isn’t going to help you.
Terraform Refactoring Is Virtually Impossible
All was not frolicking about in a garden, however. In Terraform versions 0.6 to about 0.8 or 0.9, doing something as simple as renaming a resource could result in days or weeks of refactoring variables and state files by hand or manually changing things via the command line or in the console. Changing a poorly named resource from, say, “main” to “primary” turned into a nightmare of insanity.
Terraform version 0.6 in particular (which we used when we first started) had a slew of nasty bugs and gotchas that burned us pretty badly. We’d struggle with some problem or other for a few days and finally find an update that addressed our issue, only to find a different problem or regression in the new version.
True, these limitations and problems were not imposed solely by Terraform per se. Some changes are simply impossible to pull off no matter what infrastructure management tool you use. You can’t rename a security group without destroying it, and you can’t destroy it if it’s attached to instances or a load balancer. You can’t rename a Simple Storage Service (S3) bucket, either.
However, the idea of “infrastructure as code” seemed to imply you could just take some infrastructure, fix some bugs, add some features, test and deploy, and then say “done.” Changing the order of dependencies, updating resource names, refactoring code into modules: all of these caused massive headaches and slowed down our building and deployments. This made manual changes to our infrastructure increasingly tempting, which made updating the code harder to maintain, and so on. This vicious cycle was difficult to get out of.
Lesson Learned: Just like application code, infrastructure “code” is susceptible to entropy, maintenance struggles, technical debt, and so forth. Any open source project you rely on is going to change too much or too little, too quickly or too slowly, and possibly all of the above!
Destroy Doesn’t Mean Destroy?
Versions of Terraform prior to around version 0.8 had some nearly fatal flaws in trying to destroy infrastructure. There were a lot of cases in which infrastructure might be deployed (say, in a sandbox environment), the code would be changed and deployed elsewhere, and then we couldn’t back out the changes no matter what we tried without manual intervention or serious time sinks in productivity. Part of the problem was related to the way Terraform inspected the “code” configuration path before inspecting the actual state files, and part of it was caused by unnecessary or buggy dependency looping. If you google “github issues terraform destroy,” you’ll find an endless list of complaints and bug reports. This one in particular is our favorite.
Thankfully, destroy problems are a thing of the past.
Lesson Learned: Bugs and misfeatures are going to occur, and you need to do your best to adapt. Fortunately, the Hashicorp engineers never gave up and never surrendered. Terraform continues to advance and learn from previous mistakes. It gets better and better with every release.
There were a number of chicken-and-egg problems we had to solve when we first started. How did you build an automation platform without having any automation to build your platform? And we ran into even simpler problems than that: We couldn’t create some resources because they didn’t exist in Terraform. If you manually created resources and then ran Terraform, how did you document and maintain these prerequisites? If a particular resource was in the middle of your deploy process (say, building out a Direct Connect link and attaching it to a Virtual Private Gateway), how did you automate that? Then, once the resources were added and supported in Terraform, how did you add them into your existing “code”?
There were subtler issues with bootstrapping, too. When first starting to write a Terraform module, did you purposely make it absolutely generic, then create a separate repo that added the generic module? Or did you write a specific module that addressed your immediate needs and then later make it more generic when other use cases presented themselves? Remember that even with the “mv” command, changing “code” from a resource to a module was not easy or even advisable.
Lesson Learned: No solution will cover 100% of your use cases, period. Not even solutions you write yourself. Engineering tradeoffs are unavoidable, and only thoughtful and careful planning and analysis will guide you to a reasonable compromise.
The VCS “Cure” That Was Worse Than Any Disease
Ironically, the very solution of checking infrastructure text files into git (or any other VCS) actually made our lives more difficult because it mixed together the “code” (really just static template files in a fancy DSL), the configuration values for each environment (the so-called tfvars), and the so-called state files.
True, we could track changes across files to see who changed what and when they changed it. But what if one engineer made a Terraform run, and someone else made a different run before the files were checked? Even worse, what if you followed good gitflow methods and now your state files were changed or updated in several different feature branches?
Lastly, the state files contain ultra-sensitive information, ranging from account numbers to plain text default passwords to access keys. You simply shouldn’t check these into git, no matter how private you think your repository is.
Lesson Learned: One of the best features introduced around Terraform version 0.8 (and vastly improved in 0.9) was the idea of remote state. This allowed us to move the state files out of VCS and the problems inherent in a distributed, parallel development environment. Now the “code” could live in its own repository as a generic module. The “data” part of our configuration could live in a separate repository that consumed the module and added specific variable overrides for each environment. Finally, we could write the state files to S3, safe in the knowledge that they were encrypted at rest, privately managed with strict ACLs, and backed up with the famous 11 nines of reliability.
Some Things Don’t Belong in Terraform
This is a difficult statement for us to make, and it was a tough and bitter pill to swallow. In our zeal to “AUTOMATE ALL THE THINGS,” we forgot to stop and think whether that was practical or even useful. One general example is the Relational Database Service (RDS). While it is true that you can build RDS instances from a terraform module, the question is whether you should.
Almost by definition, a database instance is a perfect unique unicorn. Putting such unicorns into so-called “code” only causes misery. You can’t store every single difference of database type in a separate repo or module, and you can’t abstract away differences like engine and database parameters. Even if you do write a module “generic enough” for your use case(s), what happens when something changes on the database outside of Terraform? For example, you might need to do a version upgrade in production. Or worse, you need to prevent a version upgrade in production! Some things can’t be reverted or changed; for example, you can’t change instance type, subnet, or encryption settings (just to name a few) without catastrophic downtime.
Another, more specific, example is Redshift. Redshift is an AWS managed data warehouse solution. Many companies use Redshift to aggregate data from multiple sources for centralized reporting or data discovery. A Redshift instance is an even rarer and more unique unicorn than a regular database instance. Also, you generally can’t change any settings in Redshift without incurring some downtime (and possibly even loss of data). Taking a snapshot of your Redshift data and restoring from a backup at petabyte scale could take days.
Lesson Learned: The overarching theme of this section is that Terraform is really good at managing resources that are generic, repeatable, and resilient to downtime. Resources in your cloud that are unique unicorns, long-lived stable infrastructure that can’t tolerate downtime, or resources that are created once and never updated again are a bad fit.
Don’t Nest Modules or Make Them Too Big
Our first Terraform modules were small, perhaps only one or two files. But we kept adding to them and expanding them so that they became five or six files. When they got too large, we wisely separated them out into separate modules with different concerns. But then we got ahead of ourselves. We started nesting modules inside other smaller modules. Updating code or resolving conflicts in the matryoshka doll structure became a nightmare. Our internal documentation tool adorning the graph output became an homage to the Flying Spaghetti Monster.
It sounds cool to tell people that you’ve automated the base infrastructure and can deploy a new environment with one command. But is that environment truly flexible? Can you quickly update or add new features across all environments? Can you safely run a small change in production without risking a whole cascade of dependency resolutions or conflicts? Taking the above configuration and stamping out a new environment in a clean account is one thing. How about updating a production environment that hasn’t been updated in a year and holds most of your company’s revenue production in one basket?
Lesson Learned: Keep it simple, stupid!
Updating Multiple Environments Is HARD
Creating one environment and updating it (especially a development environment where you might be able to tolerate some downtime) is easy. Updating two environments that might be slightly different (say, development and QA) is moderately easy. Updating two environments that might be drastically different (say, development and production) is hard. Updating an environment that is uniquely different than others and keeping everything in sync at different intervals and across multiple versions is very difficult.
Lesson Learned: We already knew operations is hard. That’s why we signed up for the job.
The Dream May Not Be Real
There is a constant tension between hopes and reality. We can easily convince ourselves that something ought to be true, therefore it is true. Certain ideals we hold as true may not even be possible.
- Infrastructure as code
- Automate everything
- One tool or platform can do everything
- Perfection is a valid goal
- We can change it/fix it later
- Keep everything in sync
Learn what works and what doesn’t, then apply it to your next project and repeat.