DBT BigQuery Performance

Mark Scannell
Jun 25 · 3 min read

Introduction

DBT is a great tool for orchestrating and managing SQL running against Data Warehouses. When using BigQuery, it can be useful to profile the DBT runs and capture the slot usage and the bytes processed to measure the cost.

DBT Enhancement

DBT issue 2808 and PR 2809 added a label to each BigQuery job with the dbt_invocation_id. This can be extracted from BigQuery’s INFORMATION_SCHEMA job logs. The job logs are available on by user, project, folder, or organisation basis — it all depends on your access. You can also get the job logs from the BigQuery audit logs. In this example, the JOBS_BY_PROJECT will be used. Lastly, on-run-start hook to track DBT metadata can be combined.

How you exactly ask? Let me show you!

Tracking Performance Step Through

Using Felipe Hoffa’s Getting Started with BigQuery and dbt, the easy way you can get your first model up and running. This is called first_model.

Step 1: Create a job log for DBT

Prior to each run, the run log table is create (if it doesn’t exist) and a row is inserted with the current timestamp, invocation_id and DBT project_name. Other metadata can be added to track many other details.

This is code that is added to dbt_project.yml and can be added to the first_project. There are a large number of macros provided by DBT, including the invocation_id.

Configuration:

Step 2: Run DBT

Run dbt normally from the first_project directory: ./local/bin/dbt run

Step 3: Analyze results

Keep in mind the INFORMATION_SCHEMA must be specified by region. In the example below, the region is EU.

Using a sub-select we can join on the match label value from the INFORMATION_SCHEMA jobs table to the dbt_log table.

First we extra a query analyzes DBT runs. This extracts data from only the past two days (reducing table scans if tables are partitioned). The query is as follows:

Finally, we group this information together to get aggregated statistics:

Query results:

Conclusion

By exposing the DBT invocation_id into the BigQuery logs it enables a pattern of rich data analysis. The BigQuery logs have referenced tables, the full query, and many, many other interesting attributes that can be correlated to the DBT runs and metadata.

Google Cloud - Community

Google Cloud community articles and blogs

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Mark Scannell

Written by

Strategic Cloud Engineer at Google Cloud

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.