Terraform Isn’t Real: Cloud Provider APIs, ownership, and lifecycle blocks

Haris Khan
Immuta Engineering
Published in
9 min readJul 29, 2022

--

When building a system of any kind, the concept of ownership is an inevitable part of all design discussions. Higher-level discussions may revolve around task and deliverable ownership; lower-level discussions may involve memory safety and race conditions. Infrastructure is just another kind of system, so it is no exception.

When we build and deploy infrastructure via Terraform, or import existing resources into Terraform configurations, we often talk about how the module now “owns” that set of resources. This abstraction is generally useful. We’re using a Terraform configuration to manage that set of resources, and when we want to make changes to the resources, we update our config and apply it.

In reality, there’s no bite behind the growl here; the abstraction is arbitrary and unenforceable. It’s perfectly possible for the same resource to be present in the state files of multiple Terraform configurations at once. There’s no neutral third party which knows who actually owns what, or a way to signify if a particular resource configuration is deprecated. Managing change is still on the humans writing the configurations and managing the infrastructure.

So, when we import something into Terraform, we are importing the object from the actual source of truth: the cloud provider API. The API however, doesn’t know what parts of a given resource are owned by what Terraform. It doesn’t intrinsically understand the ownership schema you’ve set up in your Infrastructure-as-Code repositories. It gives you the whole resource definition to import, even if you ultimately want different attributes of that resource to be managed by different modules.

Let me give you an example to illustrate.

Example: Importing Route53 Private Hosted Zones

We had some private AWS Route53 Hosted Zones that were being managed by deprecated Terraform configurations. We wanted to move these resources to new modules.

In order for networked resources in VPCs to be able to resolve DNS records in a private hosted zone, the VPC must be associated with that Hosted Zone. Since a private hosted zone could be associated with any number of VPCs, from a management perspective, it made sense that the VPC module itself “owned” the resources needed to create the association. This ownership model provides several benefits:

  1. In order to create a functional VPC, we don’t need to update two different Terraform modules. If the private hosted zone managed all the associations, we’d have to create the VPC, then go and update the hosted zone configuration and re-apply it. Not only does it introduce the possibility of human error if someone forgets to do it, it also makes building future automation more difficult as that automation now has to be able to modify Terraform configuration files and run multiple apply commands in a specific order, instead of just being able to instantiate one module.
  2. If we destroy the VPC, it will destroy the association with it. This is the inverse of the issue above. We prevent the possibility of dangling resources and broken Terraform configuration (referencing non-existent resources) if the association is created and destroyed alongside the VPC itself.

The exception to this rule would be the singular VPC Association that is required to create the resource in the first place; private hosted zones must be associated with at least one VPC upon creation. This is annoying, but whatever. We’ll deal with it.

resource "aws_route53_zone" "this" {
name = var.domain_name
dynamic "vpc" {
for_each = var.is_private ? ["1"] : []
content {
vpc_id = var.vpc_id
}
}
tags = var.tags
}

This is the way the hosted zone resource is represented in our module. It takes a domain name as an input, and if it’s a private hosted zone (represented by the bool variable is_private), it adds the required VPC association. Thus the module should always manage one, and only one, VPC association. All the others are managed by the VPC modules:

An outline of the ownership schema for the Private Hosted Zone. Arrows indicated where the resource is intended to be managed.

Side note: You may have spotted a potential cyclical dependency here. The hosted zone can’t be created without VPC-1 existing (because of the one required association in needs upon creation), but the Terraform module used to create VPC-1 also typically creates the association so it can manage it…which would require the hosted zone to already exist.

We get around this by making the creation of the hosted zone association the default behavior in the VPC module, with an optional boolean toggle to turn it off if it’s going to be the first VPC associated with a hosted zone. This toggle is set to false for VPC-1, so it creates the VPC without setting up the hosted zone association, allowing the Hosted Zone module to do so instead. This allows our dependency tree to look like this:

Back to our story: I was importing our Route53 hosted zones into new modules. I successfully imported the private zone, as well as its delegation records. When I ran a plan to fix the tags however, I noticed that it would also be destroying all the VPC associations, which were included as inline blocks.

(NOTE: the following examples of Terraform output have been redacted with ellipses)

# aws_route53_zone.this will be updated in-place
~ resource "aws_route53_zone" "this" {
...
~ tags = {
...
}
~ tags_all = {
...
}
# (4 unchanged attributes hidden)
- vpc {
- vpc_id = "vpc-1" -> null
- vpc_region = "us-east-1" -> null
}
- vpc {
- vpc_id = "vpc-2" -> null
- vpc_region = "eu-central-1" -> null
}
- vpc {
- vpc_id = "vpc-3" -> null
- vpc_region = "us-east-1" -> null
}
...
# (1 unchanged block hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.

The plan results are misleading here. There are in fact resources that are going to be destroyed, but because they’re just presented as inline blocks in a parent resource, Terraform doesn’t know that. If I had applied this plan, I would have broken DNS resolution on this hosted zone for all the VPCs listed here.

That’s no bueno, especially in a production environment. I could have timed it so I destroyed and recreated the associations within a few minutes of each other, but I didn’t want to schedule an outage of any length with customers (internal or external) just to shuffle resource metadata around. While this change would make our lives easier, we generally don’t want to inconvenience our customers without giving them some kind of tangible benefit.

Moreover, that solution does not fix the underlying problem. Terraform doesn’t know or care that I don’t want the secondary VPC associations to be managed by this resource. The cloud provider API gave Terraform the definition of the object it requested, and it imported it wholesale into tfstate.

At this point, you may be wondering: “Why is it trying to delete them? Why doesn’t it just leave the imported configuration alone?” Well, it goes back to that pesky requirement we have: we need the module to manage one of the VPC associations. Since we are already defining a VPC association inline, Terraform will try to overwrite the existing resource state. Let’s take a look at the tfstate file to illustrate this more clearly:

{
"mode": "managed",
"type": "aws_route53_zone",
"name": "this",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [
{
"schema_version": 0,
"attributes": {
...
"vpc": [
{
"vpc_id": "vpc-1",
"vpc_region": "us-east-1"
},
{
"vpc_id": "vpc-2",
"vpc_region": "eu-central-1"
},
{
"vpc_id": "vpc-3",
"vpc_region": "us-east-1"
},
...
],
...
}
]
}

All of the VPC in-line blocks are collected into a list labeled vpc. When we run a plan on our configuration it decides that it needs to replace that list with this one:

...
"vpc": [
{
"vpc_id": "vpc-1",
"vpc_region": "us-east-1"
}
]
...

…which leads it to destroy the other, existing associations.

Here’s the kicker: If we did not have to define the VPC block at all, Terraform would not do anything. The VPC block creation would be a purely additive operation, so nothing would need to be destroyed. Unfortunately, we don’t have a choice. In order to maintain consistency with new private hosted zones created by this module, the module needs to manage the primary VPC association.

It’s also important to note that this would become a problem for all private hosted zones managed by this module, even ones created by it (i.e. not imported). Let’s say that we come back to make changes to this Route53 Hosted Zone configuration later. Since we originally created it, several VPCs have been created and associated with it. When we run a plan, we notice a message that looks like this:

Note: Objects have changed outside of Terraform.Terraform detected the following changes made outside of Terraform since the last "terraform apply":...Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.

…and includes the new additions to the vpc list in tfstate. This puts us in exactly the same position as if we had imported the resource; there are now resource attributes in the refreshed state that conflict with what’s listed in the config, and Terraform will attempt to resolve the configuration “drift” by destroying the externally managed VPC associations.

So, how do we solve this problem?

Lifecycle Blocks

The lifecycle block is used to configure a set of meta-arguments that can be added to any resource to alter the conditions under which it’s created, updated, or destroyed.

resource example_resource example {
...

lifecycle {
...
}
}

There are four arguments you can use in a lifecycle block:

  1. create_before_destroy: (bool) When the provider API requires a resource to be replaced due to a configuration update (e.g. “renaming” an S3 bucket), the replacement resource will be created before the existing resource is destroyed.
  2. prevent_destroy: (bool) Terraform will force a failure on any attempted destroy operation on this resource. The only way to destroy the resource is to remove the argument from the lifecycle block.
  3. replace_triggered_by: (list) If any of referenced items on the list change, this resource will be replaced.
  4. ignore_changes: (list) if any of the specified attributes listed are modified after the resource is created, they will be ignored during planning. They are still considered when the resource is initially created, but if the resource already exists, changes to those attributes will not appear as configuration drift.

For our purposes, ignore_changes is perfect; it basically exists to solve this problem. It works not only on created resources, but also imported ones, since importing populates tfstate in the same manner. If we modify our aws_route53_zone resource as follows:

resource "aws_route53_zone" "this" {
name = var.domain_name
dynamic "vpc" {
for_each = var.is_private ? ["1"] : []
content {
vpc_id = var.vpc_id
}
}
tags = var.tags

lifecycle {
ignore_changes = [vpc]
}
}

Any changes to the vpc attribute will be ignored after the resource is imported. Since the meta-argument does not affect resource creation, the module will now work as we want it to!

  • When we use the module to create a new private hosted zone, the specified VPC block will be used to create the required VPC association.
  • If we need to import a private hosted zone, we can do so without destroying the existing VPC associations
  • If we need to update an existing private hosted zone managed by the module we will not destroy VPC associations that were created later on.

Indeed, if we go to update our live infrastructure we see that our plan now looks much less destructive:

# aws_route53_zone.this will be updated in-place
~ resource "aws_route53_zone" "this" {
...
~ tags = {
...
}
~ tags_all = {
...
}
# (5 unchanged attributes hidden)
# (1 unchanged block hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.

Conclusion

Ownership in Terraform is a construct ultimately created and enforced by human engineers, not software.

How you construct a resource’s configuration in Terraform may not translate directly to the way it is housed in, and outputted from, the cloud provider API. Using additional resources for S3 bucket lifecycle policies, security group rules, and indeed, VPC Hosted Zone associations does not dictate how those resources are actually represented in API output. The abstractions and ownership schemas we create are for our own benefit; all of which may break when we migrate the resource to a new home.

We’ve also presented a single possible answer to the problem we presented, but unfortunately there’s no one-size-fits-all solution to these situations. You need to carefully consider the implications of dealing with distributed resource ownership in Terraform. Just like with any other kind of state, it may get messy…so be prepared.

Did you know that we’re hiring? If you want to work on some really cool stuff with smart folks, drop us a line!

--

--