DBT BigQuery Performance
DBT is a great tool for orchestrating and managing SQL running against Data Warehouses. When using BigQuery, it can be useful to profile the DBT runs and capture the slot usage and the bytes processed to measure the cost.
DBT issue 2808 and PR 2809 added a label to each BigQuery job with the dbt_invocation_id. This can be extracted from BigQuery’s INFORMATION_SCHEMA job logs. The job logs are available on by user, project, folder, or organisation basis — it all depends on your access. You can also get the job logs from the BigQuery audit logs. In this example, the JOBS_BY_PROJECT will be used. Lastly, on-run-start hook to track DBT metadata can be combined.
How you exactly ask? Let me show you!
Tracking Performance Step Through
Using Felipe Hoffa’s Getting Started with BigQuery and dbt, the easy way you can get your first model up and running. This is called first_model.
Step 1: Create a job log for DBT
Prior to each run, the run log table is create (if it doesn’t exist) and a row is inserted with the current timestamp, invocation_id and DBT project_name. Other metadata can be added to track many other details.
This is code that is added to dbt_project.yml and can be added to the first_project. There are a large number of macros provided by DBT, including the invocation_id.
CREATE TABLE IF NOT EXISTS `<project>.temp.dbt_log`
INSERT INTO `<project>.temp.dbt_log` VALUES
Step 2: Run DBT
Run dbt normally from the first_project directory:
Step 3: Analyze results
Keep in mind the INFORMATION_SCHEMA must be specified by region. In the example below, the region is EU.
Using a sub-select we can join on the match label value from the INFORMATION_SCHEMA jobs table to the dbt_log table.
First we extra a query analyzes DBT runs. This extracts data from only the past two days (reducing table scans if tables are partitioned). The query is as follows:
SUM(1) AS jobs,
MIN(jobs.start_time) AS start_time,
-- If there is no end_time (null), then runtime is 0.
SUM(IFNULL(jobs.total_bytes_processed, 0))/POW(2, 30)
JOIN `region-eu.INFORMATION_SCHEMA.JOBS_BY_PROJECT` jobs
ON (dbt_log.invocation_id =
-- Select the value (any value) where key is dbt_invocation_id
(SELECT ANY_VALUE(l.value) FROM jobs.labels l
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY) AND
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
Finally, we group this information together to get aggregated statistics:
RunAnalysis AS (
SUM(1) AS runs,
SUM(jobs)/SUM(1) AS avg_jobs,
SUM(runtime_secs)/SUM(1) AS avg_runtime_secs,
SUM(total_slot_secs) AS total_slot_secs,
SUM(total_gb_processed) AS total_gb_processed,
-- BigQuery has large variation in parallel. This is
-- a crude measure of how many slots (on average) you're using
-- in parallel during queries
SUM(total_slot_secs)/SUM(runtime_secs) AS avg_parallelism,
MIN(start_time) AS first_run,
MAX(start_time) AS last_run
By exposing the DBT invocation_id into the BigQuery logs it enables a pattern of rich data analysis. The BigQuery logs have referenced tables, the full query, and many, many other interesting attributes that can be correlated to the DBT runs and metadata.