Creating “Append” Materialization in dbt (data build tool)

Published in

tiket.com

4 min readFeb 21, 2022

Why bother?

First thing first, why even bother creating a custom materialization? Well, the dbt-core (v1.0.0) comes with 4 built-in materializations.

View
Table
Incremental
Ephemeral

more details can be found here

Materializations | dbt Docs

Materializations are strategies for persisting dbt models in a warehouse. There are four types of materializations…

docs.getdbt.com

What is “Append”?

Or more specifically “Append-only”. Let’s say you want to append the result of your model to a table without any update or deletion, just insert the data even though there might be duplicate data (perhaps that’s exactly what you want!). This is perfect if you want to treat your result as log or historical data. This means that as the time goes on, tables will contain different versions of the same record, reflecting how the record has changed over time. Whatever the case might be, let’s create append-only materializations in dbt.

Let’s get to it!

Disclaimer: We’ll be mainly using Google’s BigQuery since it is one of the most prevalent data warehouse used in enterprises. Check out dbt-bigquery adapter and creating dbt materialization guide here

step 3 ??? step profit! cat meme — Image Source: memegenerator.net

To achieve our objective in this exercise, there are sequences of typical and logical processes that we can follow, such as:

Check if target the table exists
Create a temporary table based on the model as an intermediary. (We can use this as an execution result log as well)
If the table does not exist, then we generate and execute the table creation query
Else insert the new data
Profit $$$

Create the materialization macros

First, we need to create the macros in the macros folder (or whatever folder you set in the dbt project configuration yaml). Let’s name our macros append_only and set the adapter to bigquery since we only want this macros to work in bigquery adapter.

{% materialization append_only, adapter='bigquery' %}
    {%- set target_relation = this %}
    {%- set partition_by = adapter.parse_partition_by(config.get('partition_by', none)) -%}
    {%- set cluster_by = config.get('cluster_by', none) -%}{%- set tmp_identifier = "temp_" ~ target_relation.identifier %}
    {%- set tmp_relation = target_relation.incorporate(path=    {"identifier": tmp_identifier, "schema": config.get('temp_schema', default=target_relation.schema)}) -%}
    {%- set existing_relation = load_relation(this) -%}    {% if existing_relation is none or should_full_refresh() %}
        {%- set build_sql =   build_append_only_initial_sql(target_relation, tmp_relation) %}
    {% else %}
        {%- set build_sql = build_append_only_sql(target_relation, tmp_relation) %}    {% endif %}{{- run_hooks(pre_hooks) -}}    {%- call statement('main') -%}
        {{ build_sql }}
    {% endcall %}    {{ run_hooks(post_hooks) }}    {% set target_relation = this.incorporate(type='table') %}    {% do persist_docs(target_relation, model) %}    {{ return({'relations': [target_relation]}) }}{% endmaterialization %}

After that, let’s create the macros that will generate the query for both scenarios:

When the target does not exist (initial load), and
when it already exists.

You can add these macros to the same file with the materialization macros.

For the initial load process, we create the temporary table with our model SQL. Then we define the initial_sql to select from the temporary table while also adding the processed_timestamp column as the execution timestamp.

{%- macro build_append_only_initial_sql(target_relation, temp_relation) -%}    {{ create_table_as(True, temp_relation, sql) }}
    {%- set initial_sql -%}
        SELECT
          * ,CURRENT_TIMESTAMP() as processed_timestamp
        FROM
          {{ temp_relation }}
    {%- endset -%}    {{ create_table_as(False, target_relation, initial_sql) }}{%- endmacro -%}

When the table already exists, we’ll insert the data from the temporary table and add the processed_timestamp column as well.

{%- macro build_append_only_sql(target_relation, temp_relation) -%}    {%- set columns = adapter.get_columns_in_relation(target_relation) -%}    {%- set csv_colums = get_quoted_csv(columns | map(attribute="name")) %}    {{ create_table_as(True, temp_relation, sql) }}    INSERT {{ target_relation }} ({{ csv_colums }})
    SELECT DISTINCT
    *, CURRENT_TIMESTAMP() as processed_timestamp
    FROM
    {{ temp_relation }}
{%- endmacro -%}

Last but not least

Let’s create a model that will be using materialization. We’ll just use random dummy data as shown below.

{{ config(materialized='append_only') }}SELECT
  x AS id,
  case
  CAST((rand()*2) AS integer)
  when 0 then 'pending'
  when 1 then 'approved'
  when 2 then 'rejected'
  end status,
FROM
  UNNEST(GENERATE_ARRAY(1,1000))x

Next, execute the dbt run command twice to see if the table is created and new data is appended. As you can see there are multiple rows whit the same id while the processed_timestamp is different just as expected.

Check out some other articles from my team:

Beating The Data Heist

Securing Data Access at tiket.com

medium.com

Data Democratization

Providing Accessible Data to Everyone