Manage BigQuery Resources With GitOps

Erkan Ekser
Trendyol Tech
Published in
4 min readDec 29, 2022

Hi there 👋

As DWH team, we manage a big data in Trendyol. It is our duty to present the data to the teams at Trendyol in the fastest, most accurate and most reliable way. While managing all data flows, it is very important to do it systematically, keeping track of records, elimination of manual workload and in a safe way. In this article we will talk about how we do project management in BigQuery with Barış Hasdemir.

Our Agenda;

  1. Why is it important to keep records of work done?
  2. Why didn’t we choose terraform?
  3. Bq Metadata Tool Details

Why is it important to keep records of work done?

We manage thousands of tables in Trendyol and we live with data, when an incident happen it could be a big problem for the other teams. When a data is lost, it may become important to see the transactions related to this job and take action to solve this issue. It is very important to keep this on record. In addition, you can accidentally delete a dataset / table from the BigQuery UI, it is very suitable for this problem. That’s why it’s important to do these operations through a tool and keep a record.

Why didn’t we choose terraform?

We tried HashiCorp’s terraform solution to manage BigQuery resources. As a DWH team which consist of too many developers, it is quite possible to make a change at the same time. While working with Terraform, we decided that it would slow us down quite a bit to do the occasional manual operations in this way. For example;

  • Customized operations such as change column.
  • Migration needed operations.
  • Drop and re-create table in the runtime.

That’s why we decided to just trigger BigQuery operations in this version of the tool we wrote. In other words, we aim to trigger the transactions we make with GitOps, keeping track of records and to push forward them with review.

You can find the source codes here: https://github.com/erkanekser/bq_metadata

Bq Metadata App Details

The application includes the following operations:

  • Dataset create / delete / update
  • View create / delete / update
  • Table create / delete

How does it work?

  • Project has docker image, it contains miniconda , gcloud to run some google cli commands and our scripts for validation and deployment pipelines. You should create docker image with docker files in project and push it to container registry.

You can find details about GitLab Container Registry here.

  • We have importer code to import your existing structure. It will create all of the folders for your project. You should run:
python importer.py -p your_gcp_project_name

We add configurations to gitlab-ci.yml .

1. SERVICE_ACCOUNT : Google Service Account. Your project will work with using this service account.

2. BQMETADATA_RUN_PIPELINE : It should be True if you want the pipeline to be triggered. You can set a different value if you want to disable it.

  • We copied SERVICE_ACCOUNT to app/credentials.json in our image.
cat "$SERVICE_ACCOUNT" > /app/credentials.json
  • Activate gcloud with;
gcloud auth activate-service-account - key-file=/app/credentials.json

project_id=($(jq -r '.project_id' /app/credentials.json))

gcloud config set project "${project_id[@]}"
  • Create config file to create dataset/table/view. The folder directory structure should be like:
bq_metadata
└───project_1
│ └───dataset_1
│ │ │ dataset_1.json
│ │ └───tables
│ │ │ table_1.json
│ │ │ table_2.json
│ │ │ ...
│ │ └───views
│ │ │ view_1.sql
│ │ │ view_2.sql
│ │ │ ...
│ └───dataset_2
│ | │ dataset_2.json
│ │ │ ...

💥 When you merge Merge Request pipeline will;

- Run entry.sh to deploy your configs. We get files that you created, updated or deleted with :

ADDED_FILE_LIST=($(git diff --no-commit-id --name-only --diff-filter=A -M100% HEAD^ HEAD 'bq_metadata/**/*.json' 'bq_metadata/**/*.sql'))
DELETED_FILE_LIST=($(git diff --no-commit-id --name-only --diff-filter=D -M100% HEAD^ HEAD 'bq_metadata/**/*.json' 'bq_metadata/**/*.sql'))
UPDATED_FILE_LIST=($(git diff --no-commit-id --name-only --diff-filter=M HEAD^ HEAD 'bq_metadata/**/*.json' 'bq_metadata/**/*.sql'))

In the continuation of the code, we run the code in the following order:

1) Create step
- Dataset
- Table
- View
2) Update step
- Dataset
- Table
- View
3) Delete step
- View
- Table
- Dataset

Here is examples for configurations:

1️⃣ Dataset:

bq_metadata
└───project_1
│ └───dataset_1
│ │ │ dataset_1.json
{
"location": "EU",
"projectId": "project-1",
"datasetId": "dataset_1",
"properties": {
"description": "Test-dataset",
"default_table_expiration_ms": 86400000,
"default_partition_expiration_ms": 93600,
"labels": {
"type": "test"
}
}
}

2️⃣ Tables:

bq_metadata
└───project-1
│ └───dataset_1
│ │ └───tables
│ │ │ table_1.json
{
"projectId": "project-1",
"datasetId": "dataset_1",
"tableId": "table_1",
"schema": {
"fields": [
{
"description": "Test description",
"fields": [
{
"description": "Test nested",
"mode": "NULLABLE",
"name": "test_nested",
"type": "STRING"
},
{
"description": "Test nested2",
"fields": [
{
"description": "Test nested3",
"mode": "NULLABLE",
"name": "test_nested_3",
"type": "FLOAT"
}
],
"mode": "REPEATED",
"name": "test_nested_2",
"type": "RECORD"
}
],
"mode": "NULLABLE",
"name": "full_name",
"type": "RECORD"
},
{
"mode": "REQUIRED",
"name": "age",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "date_column",
"type": "TIMESTAMP"
}
]
},
"labels": {
"color": "green",
"type": "test_label"
},
"timePartitioning": {
"field": "date_column",
"type": "DAY",
"expirationDays": 30000
}
}

3️⃣ Views

bq_metadata
└───project-1
│ └───dataset_1
│ │ └───views
│ │ │ view_1.sql
create or replace view `project-1.dataset_1.view_1` 
as
select * from `project-1.dataset_1.test_table_1`;

What is next?

In the next part;

  • We will add validation steps.
  • There is no table update section for now, because table update operations may require manual operation generally. We will improve it by adding solutions for here.

Thanks to Barış Hasdemir for his important contributions to the project.

--

--