DBT BigQuery Performance

Published in

Google Cloud - Community

3 min readJun 25, 2021

Introduction

DBT is a great tool for orchestrating and managing SQL running against Data Warehouses. When using BigQuery, it can be useful to profile the DBT runs and capture the slot usage and the bytes processed to measure the cost.

DBT Enhancement

DBT issue 2808 and PR 2809 added a label to each BigQuery job with the dbt_invocation_id. This can be extracted from BigQuery’s INFORMATION_SCHEMA job logs. The job logs are available on by user, project, folder, or organisation basis — it all depends on your access. You can also get the job logs from the BigQuery audit logs. In this example, the JOBS_BY_PROJECT will be used. Lastly, on-run-start hook to track DBT metadata can be combined.

How you exactly ask? Let me show you!

Tracking Performance Step Through

Using Felipe Hoffa’s Getting Started with BigQuery and dbt, the easy way you can get your first model up and running. This is called first_model.

Step 1: Create a job log for DBT

Prior to each run, the run log table is create (if it doesn’t exist) and a row is inserted with the current timestamp, invocation_id and DBT project_name. Other metadata can be added to track many other details.

This is code that is added to dbt_project.yml and can be added to the first_project. There are a large number of macros provided by DBT, including the invocation_id.

Configuration:

on-run-start:
  - >
    CREATE TABLE IF NOT EXISTS `<project>.temp.dbt_log`
    (start_time TIMESTAMP,
     invocation_id STRING,
     project_name STRING)
  - >
    INSERT INTO `<project>.temp.dbt_log` VALUES
    (CURRENT_TIMESTAMP(),
     '{{invocation_id}}',
     '{{project_name}}')

Step 2: Run DBT

Run dbt normally from the first_project directory: ./local/bin/dbt run

Step 3: Analyze results

Keep in mind the INFORMATION_SCHEMA must be specified by region. In the example below, the region is EU.

Using a sub-select we can join on the match label value from the INFORMATION_SCHEMA jobs table to the dbt_log table.

First we extra a query analyzes DBT runs. This extracts data from only the past two days (reducing table scans if tables are partitioned). The query is as follows:

SELECT
  dbt_log.invocation_id,
  dbt_log.project_name,
  SUM(1) AS jobs,
  MIN(jobs.start_time) AS start_time,
  -- If there is no end_time (null), then runtime is 0.
  IFNULL(
    TIMESTAMP_DIFF(
      MAX(jobs.end_time),
      MIN(jobs.start_time),
      SECOND),
    0)
    AS runtime_secs,
  SUM(IFNULL(jobs.total_slot_ms, 0))/1e3
    AS total_slot_secs,
  SUM(IFNULL(jobs.total_bytes_processed, 0))/POW(2, 30)
    AS total_gb_processed
FROM
  `temp.dbt_log` dbt_log
  JOIN `region-eu.INFORMATION_SCHEMA.JOBS_BY_PROJECT` jobs
  ON (dbt_log.invocation_id =
      -- Select the value (any value) where key is dbt_invocation_id
      (SELECT ANY_VALUE(l.value) FROM jobs.labels l
       WHERE l.key='dbt_invocation_id')
     )
WHERE
  dbt_log.start_time >
    TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY) AND
  jobs.creation_time >
    TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
GROUP BY
  1,2

Finally, we group this information together to get aggregated statistics:

WITH
  RunAnalysis AS (
    <analysis query>
  )
SELECT
  project_name,
  SUM(1) AS runs,
  SUM(jobs)/SUM(1) AS avg_jobs,
  SUM(runtime_secs)/SUM(1) AS avg_runtime_secs,
  SUM(total_slot_secs) AS total_slot_secs,
  SUM(total_gb_processed) AS total_gb_processed,
  -- BigQuery has large variation in parallel. This is
  -- a crude measure of how many slots (on average) you're using
  -- in parallel during queries
  SUM(total_slot_secs)/SUM(runtime_secs) AS avg_parallelism,
  MIN(start_time) AS first_run,
  MAX(start_time) AS last_run
FROM
  RunAnalysis
GROUP BY
  1;

Query results:

Conclusion

By exposing the DBT invocation_id into the BigQuery logs it enables a pattern of rich data analysis. The BigQuery logs have referenced tables, the full query, and many, many other interesting attributes that can be correlated to the DBT runs and metadata.