Have you ever had to step in to manually fix your infrastructure deployments with Terraform running from a continuous integration system, like Google Cloud Build?
If so, you know it’s not fun–it’s just a bunch of toil.
Terraform is a stateful application–that is, it keeps a record of the resources that it manages over their lifetime in a file Terraform calls the state file, which is persisted to a backend. And to prevent multiple concurrent Terraform deployments from clobbering each other, it uses a lock to ensure only one Terraform deployment can run at a time. This enables consistency between deployments, prevents race conditions, and ensures that resources created by Terraform continue to be managed by it for their full lifecycle, just as you (probably) intended.
But what happens if Terraform is terminated before it updates its state file and releases the lock?
That’s the toil I was talking about–and that’s where you come in.
Generally, it requires the manual removal of the lock with
terraform force-unlock command followed by a re-run of
terraform apply. Resources which were created during the last, failed run may be left orphaned, as they may not have been written to the state file and, in that event, remain unmanaged by Terraform.
I’ve lived this experience numerous times for various reasons, as I’m sure many other Terraform users have. I’ve mitigated the risk by being careful not to force-quit or accidentally terminate Terraform mid-deployment. Recently, a customer experienced this while running a deployment from Google Cloud Build: their build timed out, orphaning resources, locking the state, and causing them to jump in to manually correct the issues. I can empathize with their frustration. It feels like we can do better. While this could probably be solved by extending the build timeout, I think the situation merits a more nuanced approach.
Build timeouts can occur during a Terraform deployment for a variety of reasons, but one common and totally reasonable explanation is some kind of slowdown in a cloud provider’s control plane. Occasionally–and for a variety of reasons–operations take longer than you’d expect and a deployment, which might have only lasted a few minutes, can run for multiples of that. There are plenty of valid reasons to constrain your build times–like rapid feedback to developers, or iterative deployments for development teams–and ideally, you want your build’s timeouts to accommodate a shorter (but still reasonable) deployment length.
To solve this specific problem in Google Cloud Build, I wrote a Google Cloud Build Command Wrapper and open-sourced it on GitHub. If you’re interested in learning a bit more, keep reading; otherwise, skip to the GitHub repo, read the docs, and get started!
Google Cloud Build Timeouts
In order to write the command wrapper, I needed to understand how Google Cloud Build behaved when a build timed-out. I wrote a quick Python script to simulate a long-running command, catching every valid signal, and logging it to the console. To simulate Terraform’s behavior when it catches a signal like
SIGTERM this script also runs a simulated clean-up operation.
Running this, I discovered that Cloud Build jobs are sent a
SIGTERM as one would reasonably expect, triggering the simulated cleanup phase. Unfortunately, not long after, the container is force-terminated.
That’s bad news for Terraform–and why the command wrapper is necessary.
Behold the Command Wrapper
This utility is intended to be a thin wrapper around your non-interactive commands run only in Cloud Build steps. It requires just a few pieces of information: the build ID and project ID. It reaches out to the Cloud Build API to describe the current the build, checks to see what your timeout value is, the build start time, and starts a timer that triggers at a specified time before the build is anticipated to force-terminate. When that timer triggers, it sends a signal to the wrapped process. By default, this signal is
When Terraform receives this signal, it stops creating new resources. Ongoing operations are allowed to complete, but Terraform shifts its focus, gracefully terminating the running deployment to update its state and release the state lock.
It doesn’t solve the core issue–the timeout is the real problem here–but it does give you a safety net, so to speak, to mitigate the full consequences of a forced termination.
While it’s still a good practice to lengthen your build timeouts to a reasonable length, the command wrapper provides an extra layer of protection to ensure build steps complete with enough time to exit gracefully, persist state, release locks, and run again with minimal toil, and ideally, saving engineers a bit of time for more important work.