Declare implicit dependencies between BigQuery views when using Terraform
dunnhumby have long been a user of public cloud providers and our chosen preferred provider is Google Cloud Platform (GCP). We drive the principle of infrastructure-as-code in everything we do hence have invested in Hashicorp’s Terraform as our tool for standing up infrastructure in GCP.
Of late we have been using Terraform to deploy views to BigQuery and there’s an idiosyncrasy of doing so that is worth explaining for those that are learning how to use Terraform. Take a look at the following contrived example of a Terraform file that will deploy a BigQuery dataset (called dataset1) and two BigQuery views (called view1 and view2):
We can see from the query definition of view2 that it depends upon view1 because it refers to view1 in the FROM clause.
select * from `${var.project}.dataset1.view1`
This dependency means that Terraform must deploy view1 before view2, but there is nothing in this configuration to tell Terraform to do that. We can confirm this by examining the output fromterraform graph
:
digraph {
compound = “true”
newrank = “true”
subgraph “root” {
“[root] google_bigquery_dataset.dataset1” [label = “google_bigquery_dataset.dataset1”, shape = “box”]
“[root] google_bigquery_table.view1” [label = “google_bigquery_table.view1”, shape = “box”]
“[root] google_bigquery_table.view2” [label = “google_bigquery_table.view2”, shape = “box”]
“[root] provider.google” [label = “provider.google”, shape = “diamond”]
“[root] google_bigquery_dataset.dataset1” -> “[root] provider.google”
“[root] google_bigquery_table.view1” -> “[root] google_bigquery_dataset.dataset1”
“[root] google_bigquery_table.view2” -> “[root] google_bigquery_dataset.dataset1”
“[root] meta.count-boundary (count boundary fixup)” -> “[root] google_bigquery_table.view1”
“[root] meta.count-boundary (count boundary fixup)” -> “[root] google_bigquery_table.view2”
“[root] provider.google (close)” -> “[root] google_bigquery_table.view1”
“[root] provider.google (close)” -> “[root] google_bigquery_table.view2”
“[root] provider.google” -> “[root] var.project”
“[root] provider.google” -> “[root] var.region”
“[root] root” -> “[root] meta.count-boundary (count boundary fixup)”
“[root] root” -> “[root] provider.google (close)”
}
}
Terraform knows view2 is dependent on dataset1, but it knows nothing about the dependency on view1. That’s a problem because Terraform could inadvertently deploy them in the wrong order in which case a deploy-time error would occur.
So let’s fix this. Terraform allows us to explicitly declare dependencies between resources using the depends_on
attribute, like so:
However this is not the preferred way to solve this problem, as Hashicorp say themselves:
Implicit dependencies via interpolation expressions are the primary way to inform Terraform about these relationships, and should be used whenever possible.
Let us instead solve this using interpolation expressions, like so:
Let’s take a closer look at those interpolation expressions:
${google_bigquery_table.view1.project}.${google_bigquery_table.view1.dataset_id}.${google_bigquery_table.view1.table_id}
Notice we are referring to the project
, dataset_id
& table_id
attributes of view1 to build up the name of the referenced view. Examining the output of terraform graph
again shows us that Terraform is now aware of view2's dependency on view1:
digraph {
compound = “true”
newrank = “true”
subgraph “root” {
“[root] google_bigquery_dataset.dataset1” [label = “google_bigquery_dataset.dataset1”, shape = “box”]
“[root] google_bigquery_table.view1” [label = “google_bigquery_table.view1”, shape = “box”]
“[root] google_bigquery_table.view2” [label = “google_bigquery_table.view2”, shape = “box”]
“[root] provider.google” [label = “provider.google”, shape = “diamond”]
“[root] google_bigquery_dataset.dataset1” -> “[root] provider.google”
“[root] google_bigquery_table.view1” -> “[root] google_bigquery_dataset.dataset1”
“[root] google_bigquery_table.view2” -> “[root] google_bigquery_table.view1”
“[root] meta.count-boundary (count boundary fixup)” -> “[root] google_bigquery_table.view2”
“[root] provider.google (close)” -> “[root] google_bigquery_table.view2”
“[root] provider.google” -> “[root] var.project”
“[root] provider.google” -> “[root] var.region”
“[root] root” -> “[root] meta.count-boundary (count boundary fixup)”
“[root] root” -> “[root] provider.google (close)”
}
}
An associated benefit of this approach is that we don’t have to hardcode certain identifiers, such as the name of the dataset, throughout the files thus we adhere closer to the DRY principle too. If we ever wanted to (for example) change the name of the dataset we would only have to do so in one place.