Don’t Do Analytics Engineering in Snowflake Until You Read This (Hint: dbt)

Using dbt To Create Tables Using Custom Materialization

Hashmap
Hashmap
Oct 10 · 7 min read

by Venkatesh Sekar

Imagine you had an Analytics Engineering solution (think CI/CD for database objects) that worked with Snowflake Cloud Data Warehouse and is…

  • Open source
  • Easy to understand and learn if you are SQL savvy ~ 3 days
  • Git versionable
  • Designed with visual lineage in mind
  • A great way for your analytics teams to get better visibility into data pipelines

Well…it’s here and it’s called dbt!

As I have continued to work with numerous customers across our Snowflake client base, the need for a SQL-centric data transformation and data ops solution has come up time and again and dbt has really stepped up to take on the challenge.

I’m going to take you through a great use case for dbt and show you how to create tables using custom materialization with Snowflake’s Cloud Data Warehouse.

Notably, if you haven’t tried out Snowflake yet and are looking to leverage the cloud for your data warehouse or moving an existing on-premise data warehouse to the cloud (Netezza, Teradata, Exadata, etc.), you should hands-down try out Snowflake — it just works.

So let’s first dive into dbt and then focus on our use case.

What is dbt?

It facilitates writing modular SELECT SQLs and takes care of dependencies, compilation, and materialization in run time.

If you are interested in learning more about dbt here are some quick links on the getdbt site:

Source tables in dbt

When developing scripts or models with dbt, the main statements are SELECT SQL dialect. There are no create or replace statements written in model statements. This means that dbt does not offer methods for issuing CREATE TABLE statements which can be used for source tables. It’s up to the user to define these outside of dbt.

Macros in dbt

And if you search around the code base in github, you will come across common macros such as:

  • create_view_as
  • get_columns_in_relation
  • drop_relation_if_exists
  • alter_column_type
  • truncate_relation

Materialization in dbt

  • table
  • view
  • incremental
  • ephemeral

dbt also offers the capability to develop Custom Materializations as well. Knowing this, it led me to investigate if I could use this functionality to develop a custom materialization that would do the following:

  • Process a model file which contains a “CREATE TABLE” statement
  • Identify if a column has been added/updated/dropped in the definition and issue an alter statement accordingly
  • A complete, full refresh of the data
  • Backup the table before doing any modifications
  • Migrate the data after the table has been modified

As mentioned, I’m using Snowflake as the database of choice for this example, but dbt does work with other databases as well — so let’s get going on our example.

Defined Macros

Persistent Table Materialization

Our dbt Model

{{ config(materialized=’persistent_table’ ,retain_previous_version_flg=false ,migrate_data_over_flg=true )}}CREATE OR REPLACE TABLE “{{ database }}”.”{{ schema }}”.”CONTACT” (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100));

Walking Through dbt Execution Examples

full-refresh

dbt -d run -m CONTACT — full-refresh

The table is now created and I have inserted some sample records manually. Here is the screenshot:

full-refresh with Migrate Data Enabled

config(materialized=’persistent_table’ ,retain_previous_version_flg=false ,migrate_data_over_flg=true )

Here is the command you want to issue:

dbt -d run -m CONTACT — full-refresh

Again, the table is recreated and I have inserted some sample records manually.

Backup Previous Version

The screenshot below reflects the CONTACT table as in INFORMATION_SCHEMA.TABLES before refresh:

For this we set the flag ‘retain_previous_version_flg’

config(materialized=’persistent_table’ ,retain_previous_version_flg=true ,migrate_data_over_flg=true )

we issue the command to do a full-refresh as usual

dbt -d run -m CONTACT — full-refresh

The screenshot below reflects the various CONTACT tables in Snowflake as in INFORMATION_SCHEMA.TABLES after refresh:

dbt backed up the table ‘CONTACT_DBT_BACKUP_20191006125145387106’ and also retained the rows (look at row count). Due to the ‘migrate_data_over_flg’ it has also migrated over the previous set of data.

Add column

CREATE OR REPLACE TABLE "{{ database }}"."{{ schema }}"."CONTACT" (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100),LAST_UPDATED DATE)

Hmm — were you able to see that I’m not doing an ‘ALTER TABLE’ statement. Through dbt you will see this happen. Issue the command and do not set the full-refresh flag.

dbt -d run -m CONTACT

This results in dbt issuing an ‘ALTER TABLE’, as in the log below:

The screenshot reflects the Snowflake table structure after this update:

Also, note that the existing records are not deleted as this was an alter statement.

Drop column

CREATE OR REPLACE TABLE “{{ database }}”.”{{ schema }}”.”CONTACT” (FIRST_NAME VARCHAR(100),LAST_NAME VARCHAR(100),EMAIL VARCHAR(100),STREETADDRESS VARCHAR(100),CITY VARCHAR(100))

Same thing — I am not doing an ‘ALTER TABLE’ statement.

dbt -d run -m CONTACT

As before, dbt issues an ‘ALTER TABLE’, as you can see in the log below:

The database structure in Snowflake in the screenshot below:

Are There Any Limitations To Be Aware Of?

  • You cannot use this model in a ‘source’ or a ‘ref’ call
  • Do not ask dbt to do a ‘run’ across all the entire model as this could result in recreating the tables accidentally, although you’d have a backup if the flags were set

Where Should You Go From Here

You should also check out John Aven’s recent blog post (a fellow Hashmapper) on Using DBT to Execute ELT Pipelines in Snowflake.

If you use Snowflake today, it would be great to hear about the approaches that you have taken for Data Transformation and DataOps along with the challenges that you are addressing.

Some of My Other Stories


Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.

HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Hashmap

Written by

Hashmap

Innovative technologists and domain experts accelerating the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our customers http://hashmapinc.com

HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade