Data Science Collective

Advice, insights, and ideas from the Medium data science community

Get Started with Terraform

- A beginner tutorial to deploying Google Big Query Resources with Terraform (1/2)

Charlotte Patola
Data Science Collective
9 min readFeb 13, 2025

--

Terraform & Big Query

Have you heard about Terraform and Infrastructure as a Code (IaC) and become curious, but never received a proper introduction or had a chance to try it yourself? This tutorial is for you!

I had worked several years in a team where there were no designated devops role and our data infrastructure was each and everyones responsibility. Many where the days when I was sweating with cluster settings without having any clue about best practice or how different setting might be dependent on others.

When I switched to a team that used IaC and employed devops engineers, I got very curious. Now I did not need to worry about infrastructure settings anymore; they had already been defined on project and resource type level and I just needed to deploy my code. How did this all work?

I decided to start looking into the magic of Terraform and gathered some learnings. However, I realised that there were very little information aimed at curious non-devops data analysts like myself. This article is my contribution to filling this void.

This tutorial is aimed at people new to Terraform and describes setting up Terraform to deploy datasets, tables, and views to Google Big Query.

  • In this tutorial, the deployment is done manually from a local computer.
  • In this follow-up tutorial, it is automated with GitHub Actions.

1. Prerequisites

The prerequisites for this demo are a Google Cloud Platform (GCP) account with a project as well as a service account with the rights to create, modify, and destroy resources in the project we will be working with.

2. Our Demo Case

We will create one finance dataset and one HR dataset in Big Query. Both datasets will contain one table and one view. We will use a service account to authorize the changes and then store the Terraform state file in a GCP bucket.

Big Query Expected End Result

3. What is Terraform and Infrastructure as a Code?

Infrastructure as Code is a way to manage computing infrastructure with code instead of manually. In the context of GCP and our demo case, this translates into deploying and updating GCP resources, like Big Query tables and views, using files in our codebase instead of manually doing it in the GCP user interface or console.

The main benefit is increased simplicity. Developers do not have to think about infrastructure and deployment but can focus on developing. The infrastructure is defined in one place, and the code can be reused across the codespace.

Additional benefits are increased speed and decreased error risk as manual processes are kept to a minimum.

Terraform is one of the most well-known solutions for IaC. It is open-source and developed by HashiCorp. Terraform interacts with different cloud providers via provider plugins. The infrastructure is defined in Terraform files with the appendix .tf. Terraform saves the state of the infrastructure and compares it to current resource configurations, detecting changes to be deployed.

The process of deploying infrastructure with Terraform is done in three steps: init, plan, and apply:

# Init is to initialize Terraform when starting to use it in a repository.
terraform init

# Plan is to compare new changes to the current state of the code.
terraform plan

# Apply is to confirm and apply the changes from the plan.
terraform apply

4 Accessing the GCP Account

For Terraform to be able to communicate with our GCP project, we will create a service account with the rights to manage Big Query tables and views. Follow the steps listed here to set up your service account. To grant access to Big Query, assign the IAM role Big Query Data Editor to the service account.

When the service account is configured, we will generate a key for it. Terraform will use this to verify with GCP. Follow the steps listed here to do this. Then, download the JSON file and give it a suitable name, such as terraform_gcloud_keys.json. When the Terraform repository is set up, we will move the key file there.

Finally, ensure you have the Big Query API enabled for your project.

Application Default Credentials

An alternative way to authenticate with GCP is to use Application Default Credentials (ADC). This requires you to have Google Cloud CLI installed and configured. When this is set up, you run gcloud auth application-default login login from the terminal from which you will run your terraform code. A browser window will open, and you will authenticate within the browser window. The authentication has to be redone every time you start working in a new terminal. You can read more about this functionality here.

5. Installing Terraform

There are several options for installing Terraform. Download the zip package that suits your operating system if you choose the manual version. You do not need to run any exe or similar. Just add the path of the terraform binary to your PATH, or move the terraform binary to one of the locations already on your path.

You can verify the installation by opening a new terminal session and writingterraform -help. The Terraform help menu will now show.

6. Terraform File Structure

Terraform files can be organized in different ways. For this demo project, we will create a basic file structure containing the following Terraform files:

  • provider.tf
  • variables.tf
  • terraform.tfvars
  • main.tf

Here you see where in the overall repository structure the above-listed Terraform files will be located (marked in bold):

├── terraform-demo
│ ├── terraform
│ │ ├── providers
│ │ │ ├── terraform.tfstate
│ ├── resources
│ │ ├── schema
│ │ │ ├── finance_budget_schema.json
│ │ │ ├── hr_offboarding_survey_schema.json
│ │ ├── sql
│ │ │ ├── finance_budget_view.sql
│ │ │ ├── HR_offboarding_survey_view.sql
│ ├── .gitignore
│ ├── __main.tf__
│ ├── __provider.tf__
│ ├── README.md
│ ├── terraform_gcloud_keys.json
│ ├── __terraform.tfvars__
│ ├── __variables.tf__

provider.tf

The file provider.tf is our starting point. Here, we define basic Terraform settings for our repository.

In the first part of the file, we define the Google provider, which we need to communicate with GCP. The Terraform registry provides an example of how to get started using a provider.

Terraform quickstart with GCP Provider. Screenshot from https://registry.terraform.io/providers/hashicorp/google/latest/docs

Other configurations typically include project name and geographical settings. When using a service account to access our GCP account, we must also provide the path of our service account keys in the provider file. This is unnecessary if we authenticate to GCP with Application Default Credentials.

Move the service account JSON file to the root of the demo repository (see the overall repo structure overview above) and refer to it in the provider.credentials object. OBS! If you initialize git in the repository, add the file with the keys to .gitignore!

# Basic Terraform configurations
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "6.10.0"
}
}
}

provider "google" {
project = <GCP-PROJECT-NAME>
region = "europe-west3"
zone = "europe-west3-a"
credentials = "terraform_gcloud_keys.json"
}

variables.tf

If we have different GCP resources using the same settings, such as “location”, it makes sense to stick to the DRY principle and define them in one, not several places. The variables file is where we declare our variables. It is possible to set default values in this file, but it is preferred to set them in the terraform.tfvars file instead.

# Decalaration of variables
variable "project" {
type = string
description = "The Google Cloud Project used in the repo"
}
variable "project_region" {
type = string
description = "The region of the project"
}
variable "project_zone" {
type = string
description = "The zone of the project"
}
variable "location" {
type = string
description = "The location for the resource"
}

variable "bq_default_table_expiration_ms" {
type = number
description = "How long time Big Query Table will be kept before destruction"
}
variable "bq_deletion_protection" {
type = bool
description = "Shall table have deletion protection or not"
}
variable "bq_use_legacy_sql" {
type = bool
description = "Shall legacy sql be used in query or not"
}

terraform.tfvars

terraform.tfvars is the place to set values for variables defined in variables.tf. If we use different environments, for example, LIVE, QA, and DEV, we could create three different versions of the files, one for each environment.

# Assignment of values to variables.
project = <GCP-PROJECT-NAME>
project_region = "europe-west3"
project_zone = "europe-west3-a"
location = "EU"
bq_default_table_expiration_ms = 3600000
bq_deletion_protection = false
bq_use_legacy_sql = false

main.tf

In main.tf, resources that Terraform will manage are listed. As stated in the Demo Case description, we define two Big Query Datasets, each containing one table and one view.

To know how to define each resource type, we need to look at the Terraform GCP resource documentation:

Views are covered in the table resource type. The view block is documented here.

We utilize variables whenever possible (prefixed with var) and extract schema and SQL to separate files, which we refer to in main.tf.

# Datasets
resource "google_bigquery_dataset" "ds_finance" {
dataset_id = "finance"
friendly_name = "Finance Dataset"
description = "This dataset contains tables and views used by the finance department."
location = var.location
default_table_expiration_ms = var.bq_default_table_expiration_ms
}

resource "google_bigquery_dataset" "ds_hr" {
dataset_id = "hr"
friendly_name = "HR Dataset"
description = "This dataset contains tables and views used by the HR department."
location = var.location
default_table_expiration_ms = var.bq_default_table_expiration_ms
}

# Tables
resource "google_bigquery_table" "table_finance_budget" {
dataset_id = google_bigquery_dataset.ds_finance.dataset_id
table_id = "finance_budget_table"
schema = file("resources/schema/finance_budget_schema.json")
deletion_protection = var.bq_deletion_protection
}

resource "google_bigquery_table" "table_HR_offboarding_survey" {
dataset_id = google_bigquery_dataset.ds_hr.dataset_id
table_id = "HR_offboarding_survey_table"
schema = file("resources/schema/hr_offboarding_survey_schema.json")
deletion_protection = var.bq_deletion_protection
}

# Views
resource "google_bigquery_table" "view_finance_budget" {
dataset_id = google_bigquery_dataset.ds_finance.dataset_id
table_id = "finance_budget_view"
view {
query = templatefile("resources/sql/finance_budget_view.sql",
# Defining the variables used in the view SQL
{ project = var.project,
dataset_id = google_bigquery_table.table_finance_budget.dataset_id,
table_id = google_bigquery_table.table_finance_budget.table_id
}
)
use_legacy_sql = var.bq_use_legacy_sql
}
deletion_protection = var.bq_deletion_protection
}

resource "google_bigquery_table" "view_HR_offboarding_survey" {
dataset_id = google_bigquery_dataset.ds_hr.dataset_id
table_id = "HR_offboarding_survey_view"
view {
query = templatefile("resources/sql/HR_offboarding_survey_view.sql",
# Defining the variables used in the view SQL
{ project = var.project,
dataset_id = google_bigquery_table.table_HR_offboarding_survey.dataset_id,
table_id = google_bigquery_table.table_HR_offboarding_survey.table_id
}
)
use_legacy_sql = var.bq_use_legacy_sql
}
deletion_protection = var.bq_deletion_protection
}

Resources referred in main.tf

To keep main.tf uncluttered, it is good practice to separate SQL, schema, and the like to separate files. For our demo case, we need two schema files for our two tables and two view files for our two views.

SCHEMA

  • terraformDemo\resources\schema\finance_budget_schema.json
[
{
"name": "item",
"type": "STRING",
"mode": "REQUIRED",
"description": "Budget item name"
},
{
"name": "target_value",
"type": "NUMERIC",
"mode": "NULLABLE",
"description": "Target value for the budget item"
},
{
"name": "year",
"type": "INT64",
"mode": "NULLABLE",
"description": "Time period for the yearly budget"
}
]
  • terraformDemo\resources\schema\hr_offboarding_survey_schema.json
[
{
"name": "question_nr",
"type": "INT64",
"mode": "REQUIRED",
"description": "Number of onboarding question"
},
{
"name": "question",
"type": "STRING",
"mode": "REQUIRED",
"description": "Offboarding question"
},
{
"name": "answer",
"type": "STRING",
"mode": "NULLABLE",
"description": "Answer to offboarding question"
},
{
"name": "answer_time",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"description": "Time the question was answered"
}
]

SQL

  • terraformDemo\resources\sql\finance_budget_view.sql
SELECT * FROM `${project}.${dataset_id}.${table_id}`
  • terraformDemo\resources\sql\HR_offboarding_survey_view.sql
SELECT *, CONCAT(question_nr, ' - ', question) AS question_combined FROM `${project}.${dataset_id}.${table_id}`

Test Run

Now everything is ready for a first test deployment with Terraform! Initialize Terraform with terraform init. Terraform will download the provider plugins and set up the backend to store the infrastructure’s current state.

The next step is to run. terraform plan. Now, Terraform provides a plan of pending changes. In our case, this equals creating our six Big Query resources. After having inspected the plan, go ahead and deploy it with terrafom apply.

Before Terraform deploys, it will prompt you to accept with “yes.” After deployment, navigate to your GCP project to verify that the Big Query resources have been created correctly.

State in GCP Bucket

After we have verified that our Terraform structure is working, we can define a remote backend for storing the Terraform state file. If no backend is defined, Terraform saves the state file locally on the computer from where it is run. To store the backend in a GCP bucket, we first need to create a bucket in the project we will be working in. After having created the bucket, add the IAM role Storage Object Viewer to your service account.

Now we can update the provider.tf file with a new backend block. At the same time, we can also exchange provider static values for variables. Please note that variables cannot be used in the backend block.

# Basic Terraform configurations
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "6.10.0"
}
}
backend "gcs" {
bucket = "chpa_terraformdemo_backend_bucket"
credentials = "terraform_gcloud_keys.json"
}

}

provider "google" {
project = var.project
region = var.project_region
zone = var.project_zone
credentials = "terraform_gcloud_keys.json"
}

For Terraform to switch from local to remote backend, you need to re-initialize the backend: terraform init -migrate-state. When prompted to accept the backend migration, enter yes.

Status

Now we can deploy Big Query Resources from our local computer with the help of Terraform, without having to interact with GCP UI or console.

You find the repository for the final code here.

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Charlotte Patola
Charlotte Patola

Written by Charlotte Patola

Data Analyst in Financial Services / E-Commerce

Responses (1)