Create BigQuery Datasets and Tables with Terraform in an elegant and scalable way

Mazlum Tosun
Google Cloud - Community
10 min readJul 4, 2023

--

1. Explanation of the use case presented in this article

This article shows a use case with the creation of BigQuery Datasets with Tables using Terraform in an elegant and scalable way.

The Datasets and Tables are configured in a human readable Json file.

Each person in the project team can easily update this Json configuration, without knowing the underlying technology.

The Terraform code is less verbose and scalable with this configuration. To evolve the infra, we only need to update this configuration file.

Two approaches will be presented :

  • First project with a single Terraform module, to create datasets and tables
  • Second project with two Terraform modules, one for datasets and the other for tables

Why showing two different approaches :

  • A single module can be sufficient for this kind of use case
  • Sometimes if we have more complexity in the infra code, it can be interesting to separate the logic in different modules
  • In the multi modules approach, developers can update only one of them if needed (update only datasets for example)

The schema with the first project using one module :

The schema with the second project using two modules :

I also created a video on this topic in my GCP Youtube channel, please subscribe to the channel to support my work for the Google Cloud community :

English version

French version

2. Environment variables

Set the following environment variables for the two projects :

export PROJECT_ID={{GCP_PROJECT_ID}}
export LOCATION=europe-west1
export TF_STATE_BUCKET={{TERRAFORM_STATE_BUCKET}}
export TF_STATE_PREFIX={{TERRAFORM_STATE_PREFIX}}
export GOOGLE_PROVIDER_VERSION="= 4.47.0"

3. Structure of the project for one module

In this first example and project, there is only one Terraform module : datasets_and_tables

3.1 The Json configuration

The datasets and tables to create are proposed in the following Json configuration datasets_and_tables/resource/datasets_with_tables.json :

[
{
"datasetId": "team_league_raw",
"datasetRegion": "EU",
"datasetFriendlyName": "Team league Dataset containing raw data",
"datasetDescription": "Team league raw Dataset description",
"tables": [
{
"tableId": "team_stat_raw",
"tableSchemaPath": "resource/schema/team_league_raw/team_stat_raw.json"
}
]
},
{
"datasetId": "team_league",
"datasetRegion": "EU",
"datasetFriendlyName": "Team league Dataset containing domain data",
"datasetDescription": "Team league domain Dataset description",
"tables": [
{
"tableId": "team_stat",
"tableSchemaPath": "resource/schema/team_league/team_stat.json",
"partitionType": "DAY",
"partitionField": "ingestionDate",
"clustering": [
"teamName",
"teamSlogan"
]
}
]
}
]

The Json file contains a list of datasets with its fields.

Each dataset contains a list of tables.

3.2 The Terraform code

The locals.tf file :

locals {
datasetsArray = jsondecode(file("${path.module}/resource/datasets_with_tables.json"))
datasetsMap = {for idx, val in local.datasetsArray : idx => val}
tables_flattened = flatten([
for dataset in local.datasetsMap : [
for table in dataset["tables"] : {
datasetId = dataset["datasetId"]
tableId = table["tableId"]
tableSchemaPath = table["tableSchemaPath"]
partitionType = try(table["partitionType"], null)
partitionField = try(table["partitionField"], null)
expirationMs = try(table["expirationMs"], null)
requirePartitionFilter = try(table["requirePartitionFilter"], null)
clustering = try(table["clustering"], [])
}
]
])
}

The datasets array is retrieved from the Json configuration via the jsondecode function, from the file path in the current module :

datasetsArray    = jsondecode(file("${path.module}/resource/datasets_with_tables.json"))

We transform the array to map, because with Terraform we usually work with map in a for loop :

datasetsMap      = {for idx, val in local.datasetsArray : idx => val}

For tables, there is a separate Terraform resource to create them, that’s why we need to compute and flatten the list of tables at the same level. The tables of each dataset element will be assembled and flattened to constitute all the tables to create :

tables_flattened = flatten([
for dataset in local.datasetsMap : [
for table in dataset["tables"] : {
datasetId = dataset["datasetId"]
tableId = table["tableId"]
tableSchemaPath = table["tableSchemaPath"]
partitionType = try(table["partitionType"], null)
partitionField = try(table["partitionField"], null)
expirationMs = try(table["expirationMs"], null)
requirePartitionFilter = try(table["requirePartitionFilter"], null)
clustering = try(table["clustering"], [])
}
]
])

We nest two for loops, firstly on datasets then on tables.

The flatten operator is used to flatten all the tables at the same level.

It’s worth noting that some fields are optionals from the Json configuration. These fields concern the clustering and the partitioning per table and sometimes this kind of optimizations (performance and cost) are not mandatory and useless for certain categories of tables (staging…).

If we retrieve a non existing key with Terraform in a map , it will cause an error. To prevent this situation, we use a try operator :

partitionField         = try(table["partitionField"], null)

This operator retrieves the element if it exists or a specified default value.

All the local variables will be exposed to the main.tf file :

resource "google_bigquery_dataset" "datasets" {
for_each = local.datasetsMap

project = var.project_id
dataset_id = each.value["datasetId"]
friendly_name = each.value["datasetFriendlyName"]
description = each.value["datasetDescription"]
location = each.value["datasetRegion"]
}

resource "google_bigquery_table" "tables" {
for_each = {for idx, table in local.tables_flattened : "${table["datasetId"]}_${table["tableId"]}" => table}

project = var.project_id
depends_on = [google_bigquery_dataset.datasets]
dataset_id = each.value["datasetId"]
table_id = each.value["tableId"]
clustering = each.value["clustering"]

dynamic "time_partitioning" {
for_each = each.value["partitionType"] != null ? [1] : []

content {
type = each.value["partitionType"]
field = each.value["partitionField"]
expiration_ms = each.value["expirationMs"]
require_partition_filter = each.value["requirePartitionFilter"]
}
}

schema = file("${path.module}/${each.value["tableSchemaPath"]}")
}

There are two resources, one for datasets and the other for tables.

For datasets a for_each is done on the datasetsMap local variable. The code is scalable, because if we want to update our list of datasets, we only need to change the Json configuration file.

For tables the principle is the same, a for_each is applied on the tables local variable transformed as map :

for_each = {for idx, table in local.tables_flattened : "${table["datasetId"]}_${table["tableId"]}" => table}

We indicate the tables resource depends on the datasets :

depends_on = [google_bigquery_dataset.datasets]

The next part allows to manage the optional partitions block per table. For blocks, the dynamic element can make them optional.

One way to deal with dynamic blocks is to link it with a for_each If the array used by the for_each is not empty, the block will be evaluated. Otherwise not :

for_each = each.value["partitionType"] != null ? [1] : []

If the partitionType field exists, we create a non-empty array to evaluate the block, otherwise the array is empty.

3.3 Terragrunt and Cloud Build to launch the Terraform code

In this case, Terragrunt is not needed (one module), but the second project with the multi-module approach will use it, and we wanted to keep the same way to execute the Terraform code.

To have a good and deep explanation of the Terragrunt use, you can check the following article :

This article shows also the use of Cloud Build to execute the Terraform and Terragrunt code and the CI part with :

  • plan : terraform-plan-modules.yaml
  • apply : terraform-apply-modules.yaml
  • destroy : terraform-destroy-modules.yaml

3.4 Execute the Terraform apply with Cloud Build

We can execute the Terraform apply with Cloud Build via a Trigger or a gcloud command from our local machine. For simplicity we use an execution from the local machine :

gcloud builds submit \
--project=$PROJECT_ID \
--region=$LOCATION \
--config terraform-apply-modules.yaml \
--substitutions _ENV=dev,_TF_STATE_BUCKET=$TF_STATE_BUCKET,_TF_STATE_PREFIX=$TF_STATE_PREFIX,_GOOGLE_PROVIDER_VERSION=$GOOGLE_PROVIDER_VERSION \
--verbosity="debug" .

A sample of Terraform logs :

google_bigquery_dataset.datasets["0"] will be created
+ resource "google_bigquery_dataset" "datasets" {
+ creation_time = (known after apply)
+ dataset_id = "team_league_raw"
+ delete_contents_on_destroy = false
+ description = "Team league raw Dataset description"
+ etag = (known after apply)
+ friendly_name = "Team league Dataset containing raw data"
+ id = (known after apply)
+ last_modified_time = (known after apply)
+ location = "EU"
+ project = "{{project}}"
+ self_link = (known after apply)
}
# google_bigquery_table.tables["team_league_raw_team_stat_raw"] will be created
+ resource "google_bigquery_table" "tables" {
+ clustering = []
+ creation_time = (known after apply)
+ dataset_id = "team_league_raw"
+ deletion_protection = false

The result datasets and tables in BigQuery :

4. Structure of the project for multi modules

In this case, we have two modules, one for datasets and the other for tables.

Sometimes we want to separate the logic and the responsibility in different places for a better readability and maintainability, that’s why this second approach is shown.

Under the infra root folder, we have the two modules datasets and tables and the resource folder containing the Json configuration at the same level, because this configuration is shared between the modules.

4.1 The Json configuration

The Json configuration is the same as in the previous section, but with different dataset names. Indeed to have the possibility to create the same infra than the previous project, we need to prevent having duplicates for BigQuery datasets :

[
{
"datasetId": "team_league_raw_multi_modules",
"datasetRegion": "EU",
"datasetFriendlyName": "Team league Dataset containing raw data",
"datasetDescription": "Team league raw Dataset description",
"tables": [
{
"tableId": "team_stat_raw",
"tableSchemaPath": "resource/schema/team_league_raw/team_stat_raw.json"
}
]
},
{
"datasetId": "team_league_multi_modules",
"datasetRegion": "EU",
"datasetFriendlyName": "Team league Dataset containing domain data",
"datasetDescription": "Team league domain Dataset description",
"tables": [
{
"tableId": "team_stat",
"tableSchemaPath": "resource/schema/team_league/team_stat.json",
"partitionType": "DAY",
"partitionField": "ingestionDate",
"clustering": [
"teamName",
"teamSlogan"
]
}
]
}
]

We added a suffix on dataset names with multi_modules

4.2 The Terraform modules

4.2.1 The datasets module :

The locals.tf file retrieves the datasets from the Json configuration, as we showed in the previous section :

locals {
datasetsArray = jsondecode(file("${path.module}/../resource/datasets_with_tables.json"))
datasetsMap = {for idx, val in local.datasetsArray : idx => val}
}

There is a little difference here, from the path.module we need to go in the parent folder to access to the resource folder :

"${path.module}/../resource/datasets_with_tables.json"

The main.tf file creates all the datasets :

resource "google_bigquery_dataset" "datasets" {
for_each = local.datasetsMap

project = var.project_id
dataset_id = each.value["datasetId"]
friendly_name = each.value["datasetFriendlyName"]
description = each.value["datasetDescription"]
location = each.value["datasetRegion"]
}

4.2.2 The tables module :

The locals.tf file contains exactly the same logic than the previous section :

locals {
datasetsArray = jsondecode(file("${path.module}/../resource/datasets_with_tables.json"))
datasetsMap = {for idx, val in local.datasetsArray : idx => val}
tables_flattened = flatten([
for dataset in local.datasetsMap : [
for table in dataset["tables"] : {
datasetId = dataset["datasetId"]
tableId = table["tableId"]
tableSchemaPath = table["tableSchemaPath"]
partitionType = try(table["partitionType"], null)
partitionField = try(table["partitionField"], null)
expirationMs = try(table["expirationMs"], null)
requirePartitionFilter = try(table["requirePartitionFilter"], null)
clustering = try(table["clustering"], [])
}
]
])
}

The main.tf file :

resource "google_bigquery_table" "tables" {
for_each = {for idx, table in local.tables_flattened : "${table["datasetId"]}_${table["tableId"]}" => table}

project = var.project_id
dataset_id = each.value["datasetId"]
table_id = each.value["tableId"]
clustering = each.value["clustering"]

dynamic "time_partitioning" {
for_each = each.value["partitionType"] != null ? [1] : []

content {
type = each.value["partitionType"]
field = each.value["partitionField"]
expiration_ms = each.value["expirationMs"]
require_partition_filter = each.value["requirePartitionFilter"]
}
}

schema = file("${path.module}/${each.value["tableSchemaPath"]}")
}

The logic is mostly the same as the previous section, but without the depends_on because there is only the tables resource here.

The dependency between the two modules is handled by Terragrunt, the terragrunt.hcl file for the tables module indicates a dependency on the datasets module :

include "root" {
path = find_in_parent_folders()
}

dependencies {
paths = ["../datasets"]
}

With this dependencies system :

  • In the apply mode, the datasets module will be created before the tables module
  • In the destroy mode, the tables module will be destroyed before the datasets module

4.3 Execute the Terraform modules with Cloud Build

gcloud builds submit \
--project=$PROJECT_ID \
--region=$LOCATION \
--config terraform-apply-modules.yaml \
--substitutions _ENV=dev,_TF_STATE_BUCKET=$TF_STATE_BUCKET,_TF_STATE_PREFIX=$TF_STATE_PREFIX,_GOOGLE_PROVIDER_VERSION=$GOOGLE_PROVIDER_VERSION \
--verbosity="debug" .

A sample of Terraform logs :

# google_bigquery_dataset.datasets["0"] will be created
+ resource "google_bigquery_dataset" "datasets" {
+ creation_time = (known after apply)
+ dataset_id = "team_league_raw_multi_modules"
# google_bigquery_table.tables["team_league_multi_modules_team_stat"] will be created
+ resource "google_bigquery_table" "tables" {
+ clustering = [
+ "teamName",
+ "teamSlogan",
]
+ creation_time = (known after apply)
+ dataset_id = "team_league_multi_modules"
+ table_id = "team_stat"

The result datasets and tables in BigQuery :

Conclusion

This article showed how to create BigQuery datasets and tables in a elegant and scalable way with Terraform.

Sometimes a configuration can be added to evolve the Terraform code more easily and with a less verbose code in the HCL part.

Any person in the project team can easily update a Json file without knowing the underlying technology.

We also showed two approaches with one and multi-module, because sometimes with more complexity, developers can choose to separate the logic in different modules.

The projects shared in this article are accessible from my Github repository.

Example with one module :

Example with multi modules :

If you like my articles, videos and want to see my posts, follow me on :

--

--

Mazlum Tosun
Google Cloud - Community

GDE Cloud | Head of Data & Cloud GroupBees | Data | Serverless | IAC | Devops | FP