Don’t Deploy Applications with Terraform

Paul Durivage
Google Cloud - Community
6 min readMar 10, 2021
Photo by Tim Gouw from Pexels

I know. This is a loaded topic. I can’t believe I’m going there–diving into the depths of a topic so heated, you probably already have strong opinions.

Seriously, Terraform is not application deployment software. Some of you are already using Terraform in this way.

The sky isn’t falling. It’s time for a nuanced discussion. Hear me out.

What is Terraform?

Terraform is–generally speaking–a cloud resource deployment tool. The most common use case for Terraform is to manage the lifecycle of cloud or virtual resources. Many of you already do this: you’re spinning up virtual machines, managing DNS zones, creating storage buckets, and tons of other things. Trust me–it’s a long list.

And you probably already know that.

It uses a configuration language, HashiCorp Configuration Language (HCL), to declaratively define the desired state of your resources. You simply define the resources in HCL, run Terraform, and bam! You end up with cloud resources on a platform somewhere, like GCP, AWS, VMware, Kubernetes, or one of the other myriad supported services.

So how does HCL become a storage bucket in a cloud somewhere?

When you invoke Terraform to deploy your desired resources, it reads your HCL files full of resource definitions and infers from those definitions a plan. It creates a resource graph–more specifically, a directed acyclic graph–which represents the dependencies between resources. Terraform walks this graph when creating a plan. The plan directly informs the order of operations Terraform uses when it is creating resources, which is referred to as an apply.

Terraform uses a plugin ecosystem to manage the abstraction between it and the infrastructure it manages — these plugins are called providers. The provider contains all the logic necessary to create, read, update, and delete resources on a given platform.

A Real World Example

Let’s demonstrate the concept with a simple thought exercise. Presume for a minute that your needs are very simple: you need to deploy and update a Google Cloud Function that is critical in your environment. This is something the Google provider enables via the google_cloudfunctions_function resource, which enables one to deploy arbitrary Cloud Functions.

Let’s take a moment to remember how Terraform will deploy your code, because even though you interact with a Cloud Function like any other cloud resource, we’re dealing with code, and this is a code deployment.

Terraform reads your resource definitions, loads the Google provider, and passes the resource definition to a create, read, update, or delete (CRUD) function in the provide. Check out the current CRUD operation mapping in the google_cloudfunctions_function resource in the code on GitHub, if you’re interested.

Scan through the Cloud Function update function code. It’s not complicated–and it’s not magic. At the risk of over-simplifying, it’s a thin wrapper around the functions API PATCH call. The result is a binary state: the cloud function updates or it errors. If it is successful, Terraform moves on. If it fails, Terraform aborts.

Just like any other cloud resource, Terraform has no awareness of your code. It updates a cloud resource. That’s all it knows. It has no awareness of of post-deployment elevated error rates or increased request latency. It has no idea if this new deployment of code subtly caused your application to fail. It doesn’t care. It only understands that it can succeed or fail at creating or updating a resource.

In the event of a control plane issue causing a failure to deploy your code to a cloud, it has no concept of a rollback, or even a roll-forward. It’s fail-fast, requiring your immediate, manual intervention.

This behavior isn’t limited to Cloud Functions: from managed instance groups to Cloud Run containers to AppEngine apps, Terraform only sees a cloud resource. Pass/fail. Fail-fast and abort.

Okay, so What’s the Problem?

First and perhaps most importantly, you are entirely at the mercy of the logic of a provider. Like I pointed out, most providers’ resource-level CRUD functions are thin wrappers around API operations. If you need more advanced logic and control wrapped around an API call, this probably means you need to write code in a provider. To explain this scenario with our Cloud Functions example, let’s presume you, the developer, would like to wait to 60 seconds after an update operation completes to check for elevated invocation error rates before marking the operation a success or failure. Simply put, without resorting to other hacks, this cannot be achieved with core Terraform without modifying an existing provider or writing an entirely new one.

Since CRUD functions wrap API operations, perhaps you see where I’m going next: you are entirely dependent on the behavior of the resource’s control plane. This is a good thing for resources like a managed instance group (MIG), where you inherit a managed update model and a control plane to manage the process end-to-end. In isolation with simple use cases this is often enough, but it tends to get complicated quickly for more advanced needs. Remember: Terraform only understands success or error states, and error states cause Terraform to stop. In the event a deployment fails do you need to perform rollback steps, cleanup, or something to that effect? That’s not something Terraform can handle natively.

While deployment processes are often a precise sequence of steps and checks, Terraform has limited inputs to control its flow. Like I mentioned earlier, it uses a declarative model in its configuration language, HCL, which results in the application inferring its execution order from your desired state. There are ways to mange this, but I find the end result is often a complicated, sometimes unmanageable web of references and endless null_resource local-exec to helper scripts. This complicates your resource definitions, making it harder to understand what actions Terraform will perform when it runs. It also introduces some variability into the equation: Terraform can’t take local-exec results into account in its plan until runtime, making the output of a plan less valuable!

And here’s a personal gripe: ever have to jump in for some reason to fix a Terraform problem, like reverting to an old state file, manually releasing a state lock, manually importing resources, or something like that? Can you imagine having to deal with that on top of fixing your failed application deployment?

Now, I know what you’re thinking: I know Terraform well enough to work around these issues. My provider of choice is good enough. The control plane does what I need it to. My app just isn’t that important.

And maybe that’s true.

What’s the Point Again?

In order to use the right tool for the job, you need to acknowledge both the strengths and the limitations of the tools available. While its simple model for managing resources works great for infrastructure, Terraform is a bit too limited for orchestrating complex deployments without resorting to complex hacks, wrappers, plugins, frameworks, templates, and other strategies far too unwieldy and fragile for something as important as software deployments.

It’s just my humble opinion that, for the most part, this is a job for Continuous Deployment and Delivery software. Maybe it’s a job for Kubernetes. I mean, it does excel at running and orchestrating the deployments and updates of arbitrary workloads. Maybe it’s a Jenkins pipeline. Maybe it’s a pipeline running in Cloud Build or even better: Cloud Deploy. Maybe it’s ArgoCD. Or Tekton. Or Flux. The point of this post isn’t to endorse any one tool in particular–only to highlight where Terraform falls short. Pick — maybe even write — a tool that works for you.

So what’s the point? I guess it’s this: Maybe Terraform isn’t always the wrong tool for deploying apps, but it’s almost never the right one.

--

--

Paul Durivage
Google Cloud - Community

Sometimes I sit in silence. Sometimes I solve problems with 1s and 0s.