Don’t Do Analytics Engineering in Snowflake Until You Read This (Hint: dbt)

Using dbt To Create Tables Using Custom Materialization

--

by Venkatesh Sekar

Imagine you had an Analytics Engineering solution (think CI/CD for database objects) that worked with Snowflake Cloud Data Warehouse and is…

  • Open-source
  • Easy to understand and learn if you are SQL savvy ~ 3 days
  • Git versionable
  • Designed with visual lineage in mind
  • A great way for your analytics teams to get better visibility into data pipelines

Well…it’s here and it’s called dbt!

As I have continued to work with numerous customers across our Snowflake client base, the need for a SQL-centric data transformation and data ops solution has come up time and again and dbt has really stepped up to take on the challenge.

I’m going to take you through a great use case for dbt and show you how to create tables using custom materialization with Snowflake’s Cloud Data Warehouse.

Notably, if you haven’t tried out Snowflake yet and are looking to leverage the cloud for your data warehouse or moving an existing on-premise data warehouse to the cloud (Netezza, Teradata, Exadata, etc.), you should hands-down try out Snowflake — it just works.

So let’s first dive into dbt and then focus on our use case.

What is dbt?

dbt is a command line tool based on SQL and is primarily used by analysts to do data transformations. In other words, it does the ‘T’ in ELT.

It facilitates writing modular SELECT SQLs and takes care of dependencies, compilation, and materialization in run time.

If you are interested in learning more about dbt here are some quick links on the getdbt site:

Source tables in dbt

In dbt, a “Source Table” holds data on which data transformations are done. The transformations are SELECT SQL statements which are joined together and then materialized into tables.

When developing scripts or models with dbt, the main statements are SELECT SQL dialect. There are no create or replace statements written in model statements. This means that dbt does not offer methods for issuing CREATE TABLE statements which can be used for source tables. It’s up to the user to define these outside of dbt.

Macros in dbt

But if you look closely at how dbt offers customization or enhancement to be developed using Macros, you will realize that these are pretty much Jinja templates.

And if you search around the code base in github, you will come across common macros such as:

  • create_view_as
  • get_columns_in_relation
  • drop_relation_if_exists
  • alter_column_type
  • truncate_relation

Materialization in dbt

Materializations are strategies for persisting dbt models in a warehouse such as Snowflake. There are four types of materializations built into dbt. They are:

  • table
  • view
  • incremental
  • ephemeral

dbt also offers the capability to develop Custom Materializations as well. Knowing this, it led me to investigate if I could use this functionality to develop a custom materialization that would do the following:

  • Process a model file which contains a “CREATE TABLE” statement
  • Identify if a column has been added/updated/dropped in the definition and issue an alter statement accordingly
  • A complete, full refresh of the data
  • Backup the table before doing any modifications
  • Migrate the data after the table has been modified

As mentioned, I’m using Snowflake as the database of choice for this example, but dbt does work with other databases as well — so let’s get going on our example.

Defined Macros

I have defined the following macros in snowflake_helper_macros

Persistent Table Materialization

I have defined the custom materialization persistent_table_materialization to handle the above-defined needs. In short, the implementation has the following logic:

Our dbt Model

Below is an example of the model file which now could be materialized by dbt. The example is here CONTACT

{{ config(materialized=’persistent_table’ ,retain_previous_version_flg=false ,migrate_data_over_flg=true )}}CREATE OR REPLACE TABLE “{{ database }}”.”{{ schema }}”.”CONTACT” (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100));

Walking Through dbt Execution Examples

To see this materialization in action, here is a walkthrough with screenshots on the various facilities.

full-refresh

Let’s start off with no tables defined in Snowflake. A “full-refresh” flag would mean to create the table as if nothing existed. Should the table exist, it will recreate the table (due to ‘CREATE OR REPLACE’ in the model).

dbt -d run -m CONTACT — full-refresh

The table is now created and I have inserted some sample records manually. Here is the screenshot:

full-refresh with Migrate Data Enabled

Let’s do a full-refresh with the ‘migrate_data_over_flg’ set to true

config(materialized=’persistent_table’ ,retain_previous_version_flg=false ,migrate_data_over_flg=true )

Here is the command you want to issue:

dbt -d run -m CONTACT — full-refresh

Again, the table is recreated and I have inserted some sample records manually.

Backup Previous Version

Let’s go through an example of how to retain the previous copy and see what happens after migration.

The screenshot below reflects the CONTACT table as in INFORMATION_SCHEMA.TABLES before refresh:

For this we set the flag ‘retain_previous_version_flg’

config(materialized=’persistent_table’ ,retain_previous_version_flg=true ,migrate_data_over_flg=true )

we issue the command to do a full-refresh as usual

dbt -d run -m CONTACT — full-refresh

The screenshot below reflects the various CONTACT tables in Snowflake as in INFORMATION_SCHEMA.TABLES after refresh:

dbt backed up the table ‘CONTACT_DBT_BACKUP_20191006125145387106’ and also retained the rows (look at row count). Due to the ‘migrate_data_over_flg’ it has also migrated over the previous set of data.

Add column

Now I want to add a column ‘LAST_UPDATED’ to the definition.

CREATE OR REPLACE TABLE "{{ database }}"."{{ schema }}"."CONTACT" (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100),LAST_UPDATED DATE)

Hmm — were you able to see that I’m not doing an ‘ALTER TABLE’ statement. Through dbt you will see this happen. Issue the command and do not set the full-refresh flag.

dbt -d run -m CONTACT

This results in dbt issuing an ‘ALTER TABLE’, as in the log below:

The screenshot reflects the Snowflake table structure after this update:

Also, note that the existing records are not deleted as this was an alter statement.

Drop column

Now let’s remove the ‘LAST_UPDATED’ column.

CREATE OR REPLACE TABLE “{{ database }}”.”{{ schema }}”.”CONTACT” (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100))

Same thing — I am not doing an ‘ALTER TABLE’ statement.

dbt -d run -m CONTACT

As before, dbt issues an ‘ALTER TABLE’, as you can see in the log below:

The database structure in Snowflake in the screenshot below:

Are There Any Limitations To Be Aware Of?

If you are using this approach with dbt, keep the following points in mind:

  • You cannot use this model in a ‘source’ or a ‘ref’ call
  • Do not ask dbt to do a ‘run’ across all the entire model as this could result in recreating the tables accidentally, although you’d have a backup if the flags were set

Where Should You Go From Here

With the capability that dbt brings to the table for creating tables using custom materialization, I feel very good in recommending dbt for your CI/CD pipeline database objects. Also, you can access everything in my git repo here.

You should also check out John Aven’s recent blog post (a fellow Hashmapper) on Using DBT to Execute ELT Pipelines in Snowflake.

If you use Snowflake today, it would be great to hear about the approaches that you have taken for Data Transformation and DataOps along with the challenges that you are addressing.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

How does Snowflake compare to other data warehouses? Our technical experts have implemented over 250 cloud/data projects in the last 3 years and conducted unbiased, detailed analyses across 34 business and technical dimensions, ranking each cloud data warehouse.

To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.

--

--