Loading and transforming data into BigQuery using dbt

A data engineering tool to build Data Lakes, Data Warehouses, Data Marts, and Business Intelligence semantic layers in BigQuery

Lak Lakshmanan
Google Cloud - Community
5 min readJun 4, 2020

--

I seem to be hearing a lot about dbt recently (most recently in a seminar hosted by Graham Polley and Hamza Khan), and so I decided to see what all the fuss was about.

dbt, short for data build tool, is a data engineering tool that allows you to capture dependencies when you have multiple steps to transform your SQL. This is very common when you do ELT — when you load the raw data into your SQL data warehouse and then carry out transformations in SQL to clean up the data, create data marts, and what not.

In Chapter 4 of the BigQuery book, we loaded up some data about US colleges from a CSV file and did a bunch of transformations. Let’s see how to automate the production of all the tables and views in that chapter using dbt.

0. Setup

First, you have to do four things:

  • pip install dbt
  • Create a service account for dbt to use and allow that service account to create BigQuery tables and run queries
  • Copy a profile and service account key file into ~/.dbt
  • Create a new DBT project

I’ve captured all these steps for you in a bash script. So, simply git clone by repository and run setup.sh

git clone https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/
cd bigquery-oreilly-book/blogs/dbt_load
./setup.sh

Second, change the project name in profiles.yml and the location of the keyfile to reflect your home directory (mine says /home/jupyter). These are the two lines to change:

project: ai-analytics-solutions  
...
keyfile: /home/jupyter/.dbt/keyfile.json

1. Loading the data into BigQuery (“Data Lake”)

The first step of an ELT pipeline is the EL. You have to load the data into BigQuery before you can do all the transformations using dbt.

Because the source data is on Google Cloud Storage and is in a supported format (CSV/JSON/Avro/Parquet/ORC at the time I’m writing this), the simplest way to get the data into BigQuery is to set up an external table. This way, we don’t have to duplicate the raw data, just define a table on top of the GCS files by running load_external_gcs.sh:

./load_external_gcs.sh

What the above script does is:

bq $LOC \
mkdef \
--source_format=CSV \
--noautodetect \
$INPUT \
$SCHEMA \
| sed 's/"skipLeadingRows": 0/"skipLeadingRows": 1/g' \
| sed 's/"allowJaggedRows": false/"allowJaggedRows": true/g' \
> $DEF
bq mk --external_table_definition=$DEF ch04.college_scorecard_gcs

You will now have an external table named college_scorecard_gcs in BigQuery. The actual data is in GCS. We can think of this as our Data Lake.

2. First Transformation (“Data Warehouse”)

Let’s now define the first transformation. This is in models/college_scorecard.sql:

WITH etl_data AS (
SELECT
* EXCEPT(ADM_RATE_ALL, FIRST_GEN, MD_FAMINC, SAT_AVG, MD_EARN_WNE_P10)
, {{target.schema}}.cleanup_numeric(ADM_RATE_ALL) AS ADM_RATE_ALL
, {{target.schema}}.cleanup_numeric(FIRST_GEN) AS FIRST_GEN
, {{target.schema}}.cleanup_numeric(MD_FAMINC) AS MD_FAMINC
, {{target.schema}}.cleanup_numeric(SAT_AVG) AS SAT_AVG
, {{target.schema}}.cleanup_numeric(MD_EARN_WNE_P10) AS MD_EARN_WNE_P10
FROM
ch04.college_scorecard_gcs
)
SELECT * FROM etl_data

Several things to note about this:

  • There is no CREATE OR REPLACE TABLE. dbt will wrap the creation of this, and write the result to a table named college_scorecard in the dataset ch04. Why? Because in dbt_project.yml, I have specified:
models:
college_scorecard:
materialized: table

description: ELT of College Scorecard Data
  • A “model” in dbt is what you and I would call a query. The name of the dataset was specified in profiles.yml and the name of the table matches the name of the SQL file.
  • Notice that I am calling a user-defined function ch04.cleanup_numeric (target.schema is the output dataset, so ch04 here). This is defined in macros/cleanup_numeric.sql:
{% macro cleanup_numeric_macro() %}CREATE OR REPLACE FUNCTION {{target.schema}}.cleanup_numeric(x STRING) AS
(
IF ( x != 'NULL' AND x != 'PrivacySuppressed',
CAST(x as FLOAT64),
NULL )
);
{% endmacro %}
  • All the macros have to be executed before table creation starts. We do that by adding to dbt_project.yml the following line, to ensure that the above mentioned macro runs:
on-run-start:
- "{{cleanup_numeric_macro()}}"

The end result is that a table named college_scorecard gets created. This table has all the columns in the CSV file, but five of those columns are converted into numbers. This is cleanedup data. We might even enrich it or filter it. We can think of this as our Data Warehouse.

The way I’ve used the macro, to do UDFs, is a bit of a hack. The real purpose of macros in dbt is to capture SQL snippets that you want to reuse. For example, if we keep selecting the same 5 columns, we could define a macro like this:

{% macro selective_college_features() %}
INSTNM, ADM_RATE_ALL, FIRST_GEN, MD_FAMINC, SAT_AVG
{% endmacro %}

and refer to the macro in all the model SQL files:

SELECT
{{ selective_college_features() }}
, MD_EARN_WNE_P10
...

3. Second Transformation (“Datamart”)

Assume that some business users want a list of selective colleges that are friendly to first-generation students. We know the SQL we want to run. We can put this in models/selective_firstgen.sql:

SELECT
INSTNM, ADM_RATE_ALL, FIRST_GEN, MD_FAMINC, SAT_AVG, MD_EARN_WNE_P10
FROM
{{ ref('college_scorecard') }}
WHERE
SAT_AVG > 1300
AND ADM_RATE_ALL < 0.2
AND FIRST_GEN > 0.1

Notice a few things:

  • We are providing a much smaller subset of the data (only a few columns)
  • The FROM refers to the college_scorecard that dbt created

The output gets materialized into a table named selective_firstgen.

4. Third transformation (“Reporting”)

Let’s say we have a reporting use case where we need to rank these selective, first-generation-friendly colleges based on the median income. We can write one more SQL query in models/selective_firstgen_top10.sql:

{{ config(materialized='view') }}   /* overrides the project definition */SELECT
INSTNM, ADM_RATE_ALL, FIRST_GEN, MD_FAMINC, SAT_AVG
FROM
{{ ref('selective_firstgen') }}
ORDER BY
MD_FAMINC ASC
LIMIT 10

Again, the FROM clause here refers to the other dbt query. dbt manages the dependencies for us. It will create a view named selective_firstgen_top10 because of the config statement at the top of the SQL file.

5. Run it

To run it, simply do:

cd college-scorecard
dbt run

This will create the function (because of the macro definition), two tables (college_scorecard and selective_firstgen) and one view (selective_firstgen_top10):

(Recall that we had to create college_scorecard_gcs outside of dbt, since it only does the T part of ELT).

Wrap this up into a Docker container, and you can automate the whole thing with Cloud Build, or do CI/CD such that the tables get created everytime the SQL statements get modified.

There seems to be some sort of testing capability built into dbt, but I didn’t explore it. In a production system, you’d probably do that too.

6. Next steps

In this simple example, we see a few common data engineering patterns. BigQuery functions as a:

  • Data Lake: federated query on file stored in Google Cloud Storage
  • Data Warehouse: Highly scalable SQL for cleaned up, enriched data
  • Data Mart: Data extracts to support business users
  • Business Intelligence semantic layer: Views that capture KPIs consistently

dbt was able to set this all up for us in one go, capturing the dependencies. We can check in the project into version control, do testing, scheduled builds, and CI/CD.

The complete code example is on GitHub.

Also, read this nice intro to dbt by Hamza.

Enjoy!

--

--

Lak Lakshmanan
Google Cloud - Community

articles are personal observations and not investment advice.