Version control Big Query with Terraform (with CI/CD too)
Making sure all changes in Big Query are accountable for
Introduction
Working with teams using Big Query and creating views can get messy when people start changing views without informing other members of the team.
Some teams are accustomed to using comments on the top of the view, or even a spreadsheet to record changes. Obviously, this leads to a problem in the long run as people may forget to comment or not doing it properly.
Without further ado, how can I version control my Big Query to ensure I can keep track and roll back changes when needed and better manage the project with my team.
To achieve this, we will be looking into 2 technology which is Terraform (Infrastructure as code), and a version control system of your choice (Github, Gitlab, etc). We will be using Github today! Do not be too worried if you have never heard of Terraform, this post will sort of be a “quick introduction to Terraform on GCP” too!
Quick Jump
- Getting Started
- Initialize Terraform environment
- Creating our dataset and view in Terraform
- Version control and setup
- git commit, git push
- Make changes and test if we achieved our goal
- Conclusion
Getting Started
This part will be for environments that have not been set up yet.
Not much to get started! You just need to install:
1. Terraform. They provided a great guide, so just make sure you follow along and you should be able to call terraform
in your terminal/command prompt
2. git. Please install git for your OS so we can perform basic commands such as git push
TL;DR of steps
- Create a service account on a GCP project
- Create and download JSON key for that service account
- Setup Google Cloud Storage(GCS) to store Terraform state
- Grant service account permission to read, write and create objects in that bucket
- Setup Terraform to connect to GCS
Details for first-timers
Creating a service account and key
Terraform will be interacting with our Google Cloud Platform (GCP) project on behalf of us using something called a service account.
Go ahead and create your GCP project with this tutorial provided by Google. When you reached the prompt to choose the IAM role (Step 7 in that tutorial as of June 2021), you will need to grant the Big Query Data Editor role for this tutorial's sake as shown in the snippet above. As you progress, feel free to narrow down the permission to what is required only. Click next and done.
After creating the service account, click into the service account and you will be greeted with a screen similar to the snippet above. Choose KEYS -> ADD KEY -> Create new key. Next, a popup will appear, choose JSON and click create. It will download a JSON file. Keep it safe!
Create a Google Cloud Storage (GCS) to store Terraform State
Terraform state basically remembers and tells Terraform how your infrastructure looks like now. I suggest reading Terraform docs to understand it better. We will be using GCS to store the state for 2 main reasons:
- Sharing the state on Github is generally not a good practice as it may contain sensitive information
- You do not need to worry about losing it and rebuilding the infrastructure
So let us start by creating a GCS bucket following the tutorial here. Remember the bucket name you created as we will use it later. On that bucket, choose permission and add the service account created earlier with Storage Admin permission. You can follow this tutorial on how to do so.
Initialize Terraform environment
Create a working folder, and inside that folder, we will create a file called providers.tf
. I usually use Visual Studio Code to manage my Terraform configurations. tf is a Terraform configuration file, where all our Terraform stuff will live in. Terraform providers are basically the platform you are going to work with, whether it is Microsoft Azure, GCP, AWS, or others. I am calling this file providers.tf,
as this is where I am keeping all my provider's configurations, you can name it however you want but providers would be a better standard.
Paste the above content in your providers.tf
. In the first “block” called terraform {…}, we are telling Terraform to store our state in GCS. Change YOUR_BUCKET_NAME to the bucket name you created earlier. Prefix basically is just the folder within the bucket.
The next block called provider indicates that we are using Terraform Google Provider configuration, and initializing it with your YOUR_PROJECT_NAME.
Now Terraform will not be able to read your GCS storage yet, and this is where our service account comes in. There will be a few methods declaring your service account to Terraform, but the method we will be using will be useful in setting up the CI/CD on Github later. We will set the environment variable to our JSON file we downloaded earlier, so move that JSON file into your project folder too. In that same folder, open terminal/command prompt and set the environment as:
Windows user:
set GOOGLE_CREDENTIALS=JSON_FILE_NAME.json
Linux/macOS:
export GOOGLE_CREDENTIALS=JSON_FILE_NAME.json
Google provider on Terraform will automatically detect the service account key from the environment variable. Now here comes your very first Terraform command to initialize the environment!
terraform init
If all goes well, you should see lots of green text on your console. GCS should now show a folder called state and a file inside that folder that ends with .tfstate.
However, if you are met with errors, do not worry and carefully read what are the errors. If it is something about NewClient() failed, ensure your environment variable is set correctly to the JSON file. If it is about access denied, ensure you gave proper access to your service account on the Google IAM page.
Creating our first dataset and view
Great! Now that we have initialized our environment, we can start managing Big Query using Terraform.
Just like how we have providers.tf
to manage our providers, we are going to create bigquery.tf
to manage our Big Query. You can even drill down to have 3 separate files for Dataset, Tables, and Views, but for the time being, we will just create bigquery.tf
. We will create another folder to store our view SQL at bigquery/views/vw_aggregrated.sql
.
# Some demo content in vw_aggregrated.sqlSELECT
LoadDate, Origin, Destination, COUNT(ID) AS counts
FROM
`terraform-testing-gcp.demoset.DemoTable`
GROUP BY
1, 2, 3
Copy and paste the above code into bigquery.tf
. To understand what is going on, the first resource contains 2 parameters, the first being the resource type (google_bigquery_dataset) and the second being the ID (views) you can define yourself. You can find the available resources for Google providers here. We are creating a dataset in the US using the “google_bigquery_dataset” block with the ID “views”. I have given it a description and some labels. The labels are completely optional and you can remove them.
Next, we are creating a resource “google_bigquery_table” with the ID “vw_aggregated”. For the dataset_id, it is referring to the views dataset_id we created earlier on. This time, since we are creating a view, we will have to open a view {…} block as described in the tutorial here. The first parameter we will pass in will be the SQL we want to use. There are 2 methods of doing it, one being directly typing the SQL into bigquery.tf itself, eg: query = "SELECT * FROM ... WHERE 1 = 1
. However we are looking into maintainability, so we have defined our SQL in a folder just now at bigquery/views/vw_aggregated.sql
.
Alright! Let us run a command to standardize our Terraform code format, then dry run our configuration. Terraform plan will basically “plan” and let us know what resources will be created/deleted/modified etc.
terraform fmt
terraform plan
If you see x resource to be created
, with x being a number, you are good to go! If you are getting x resource to be deleted
, please check and verify if that is intended. Once you have verified this is the intended action, run terraform apply
to apply the configuration.
If done correctly, check your Big Query UI and you should see the resource you defined in your Terraform configuration created! In our case, a dataset called views with vw_aggregated in it was created.
Version control setup
After we have verified our Terraform configuration works as intended, it is time to version control changes to the configuration. The first step would be to create a Github project, and set it to a private repo as you would not want your infrastructure setup exposed to the public :).
Follow this guide by Github to create a secret called GOOGLE_CREDENTIALS
. The value for this secret will be the contents of our JSON key file we used to set our environment variable just now. Just open that JSON file with a text editor and copy-paste the content into the value field of the secret. It should look something similar to the snippet above.
The “secret” we will be keeping is the service account key to allow Github Actions to help apply our Terraform configuration every time we commit a change.
Github secret is great in our scenario for 2 reasons:
- We do not need to commit our JSON key file that other contributors/members can download and misuse
- Github Actions will be applying the configurations, and you do not need to distribute service accounts with write permissions to contributors. This forces contributors/members to always commit their work and changes for it to be applied to production
Next, we will need to tell Github Actions what needs to be done once a commit happens. Create a folder called .github/workflows
and paste the following content into a file in that newly created directory called terraform.yml
.
Before pushing to Github, we would want to exclude some files or folders. On the root folder, create a file called .gitignore
and paste the following content below. This will exclude the JSON key we copied into the directory earlier on, and any state file if there is. Feel free to add your own files too!
*.json
*.tfstate
For those who are still confused about the many files we created earlier, here is how your directory should look like if you followed this tutorial exactly (ignore vw_origin_KUL.sql
).
git commit, git push
Here comes the golden moment. The following command below will initialize git and push your configuration to Github. Ideally, if it works fine, Github Actions will start running the minute your code is pushed.
terraform fmt
git add .
git remote add origin LINK_TO_YOUR_GIT_REPO
git branch -M main
git commit -m "Initial commit"
git push -u origin main
git add will add new and changed files and folders in that directory, except those defined in .gitignore
.
git remote add origin https://github.com/… will define the repository that your current directory should belong to.
git branch -M main will create a branch called main
git commit -m “YOUR_MESSAGE_HERE” will create a “changelog”, and a good message will help yourself and others in your team identify what was changed without reading too much of codes
git push -u origin main will push your codes to the main branch.
Check your Github repository now and you should see your codes there. Clicking on the Actions tab, you will see the workflow we created earlier at .github/workflows/terraform.yml
.
If you see a green tick or orange as it is running, you are good to go! In the event of an error, just click into the workflow and read the logs from each step as the errors are quite easy to narrow down. As you can see the steps are as what we defined in .github/workflows/terraform.yml
.
Make changes and test if we achieved our goal
We are nearly done with the entire pipeline! Next is to test if what we planned to achieve has been achieved. Let us make some changes to bigquery/views/vw_aggregated.sql
. I have changed my query to
# Some demo content in vw_aggregrated.sqlSELECT
CAST(LoadDate AS DATE) AS LoadDate,
Origin, Destination, COUNT(ID) AS counts
FROM
`terraform-testing-gcp.demoset.DemoTable`
GROUP BY
1, 2, 3
Now we will need to add whatever was changed, make a changelog message, and then push it to our repository so that Github Actions can apply it for us. Always terraform fmt
to ensure code standardization.
terraform fmt
git add .
git commit -m "Updated vw_aggregated to CAST LoadDate"
git push origin main
Checking on the repo, we can see the changes that were made by me (jonathanlawhh) as shown on the snippet above.
And checking the Actions tab, we can see the workflow for that commit has been completed successfully!
And finally, we can see the view in our Big Query has been updated as expected.
Conclusion
What we have done in summary is to first set up a Github repository, initialize a Terraform environment and push that Terraform configuration to the Github repository. At the very same time, we have also prepared a CI/CD configuration terraform.yml
to help deploy our configuration once we push a change. With that, it is safe to say that we have achieved the goal of versioning control Big Query using Terraform and implemented continuous integration and deployment to our project.
Hopefully, this post is sufficient to kick start your implementation and help you learn more about what can be achieved by Terraform! Of course, the implementation can be improved based on your environment.
If you have any questions, feel free to reach out or even say hi to me through
Messenger widget: https://jonathanlawhh.com
Email: jon_law98@hotmail.com