How Snowflake makes the dbt Python models shine

Now that dbt has made Python a seamless option within your SQL pipelines, it’s super cool to see how they made this possible with Snowflake. That’s what this post is about. Are you ready to try this out?

Felipe Hoffa

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readJul 26, 2023

What is dbt

Before we start, let me explain dbt — especially for data engineers that still have not worked with it: “dbt” is a tool that helps us manage all the transformations we do with SQL to data in our databases.

What is dbt? | dbt Developer Hub

dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a…

docs.getdbt.com

For example, let’s say we have a table with all our customer orders, and we want to keep a count of how many of these orders happened during on a Holiday. We could solve this with 2 queries in SQL:

Add a column is_holiday to the table sf100_orders, by joining it to a table with all_holidays. Call the resulting table sf100_orders_annotated.
Count the rows that happen to have is_holiday=true on the table sf100_orders_annotated.

These 2 steps that combine the source data, as visualized by dbt:

Lineage graph — from the 2 source tables a table with a count of the Holidays.

We can use dbt to write these 2 transformations as “dbt models”, which are files that contain SQL and a little bit of dbt configuration (when necessary):

-- sf100_orders_annotated.sql
{{ config(materialized='table') }}

select a.*, b.date is not null is_holiday
from {{ref('sf100_orders')}} a
left join {{ref('all_holidays')}} b
on a.o_orderdate = b.date

-- count_holidays.sql
{{config(materialized='table')}}

select count(*) total_days
    , count_if(is_holiday) holiday_count
    , holiday_count/total_days ratio
from {{ref('sf100_orders_annotated')}}

These 2 files look like typical SQL — with a twist. Instead of having references to tables, dbt allows us to refer to other “models” with the use of annotations{{ref(‘sf100_orders_annotated’)}} . This is how dbt determines the lineage and dependencies between all the transformations our data goes through.

That’s the gist of dbt: Write SQL transformations in a well organized way, to keep your data processes well tested, documented, and versioned.

What’s new in dbt is that we can now write these transformations not only in SQL, but also in Python — while keeping everything else working as expected.

Our first dbt Python model

To show off how easy is to incorporate Python within a dbt pipeline, let’s change the above sf100_orders_annotated.sql into sf100_orders_annotated.py:

# sf100_orders_annotated.py
import snowflake.snowpark.functions as F

def model(dbt, session):
    df_sf100 = dbt.ref('sf100_orders')
    df_holidays = dbt.ref('all_holidays')

    joined_df = df_sf100.join(
        df_holidays, 
        df_sf100['o_orderdate'] == df_holidays['date'], 
        'left'
    )
    joined_df = joined_df.withColumn(
        'is_holiday',
        F.col('date').isNotNull()
    )

    result_df = joined_df.select(df_sf100['*'], 'is_holiday')
    return result_df

What’s interesting in this code:

We were able to replace a SQL dbt model with a model written in pure Python.
dbt still takes care of connecting to all dependencies by using references to them, like indbt.ref(‘sf100_orders’) anddbt.ref(‘all_holidays’).
It doesn’t matter if these dependencies are SQL or Python dbt models. They are just a source of data for our current model.
All we need to return at the end of this model is a dataframe. Then dbt takes care of materializing the dataframe into a table the other models can use — just like it did with the equivalent SQL model.
To produce the desired join between the tables we used familiar PySpark dataframe manipulation functions.
But this is not PySpark — these are Snowpark Python Dataframes, which bring us great advantages (that we will review next).

Showing off that doing this join with Dataframes is familiar to anyone that understands PySpark — even to 2021-bound ChatGPT.

Benchmarking this dbt Python model on Snowflake

The orders table has 150 million rows (straight out of TPC-H-SF100), and we are joining it with a holidays table that has 458 rows. The join should be fast, and then most of the time will be spent re-materializing the table into a new table with the same 150 million rows, now with the new is_holiday column.

With the classic dbt SQL model: 15 seconds, to join the 2 tables and re-materialize the large one.

With the dbt Python model: 15 seconds (+9s of Python code initialization).

This is a nice result: It means Snowflake was able to parallelize the Python code to transform and materialize the resulting table, at virtually the same speed than pure SQL.

How the dbt Python models work on Snowflake

What dbt does to make the Python models work in Snowflake is pretty cool: They take your model code and push it (mostly untouched) into Snowflake, wrapped in a Snowflake Python Stored Procedure. This wrapping code takes care of setting up the right libraries, materializing the resulting data frame, and executing the stored procedure.

You can find exactly how dbt pulls this off by looking into your Snowflake logs:

How dbt wraps and pushes your Python code into a Snowflake Python Stored Procedure

Setting this up is what took dbt and Snowflake around 9 seconds — but then to execute the join as defined in the Python code, it only took 15 seconds while re-materializing the 150m rows. What’s the secret sauce?

This is the “magic” that the Snowpark Python Dataframes can pull off: All the dataframe manipulation (that we coded in Python), gets translated by Snowflake into a SQL query. You can find the executed query in your Snowflake logs too:

The SQL code that Snowpark generated and executed out of our Python data frame transformations

This is really cool:

Your Python experts can keep writing Python, even within dbt pipelines.
Your Python code runs inside Snowflake, as dbt takes care of setting this up.
Snowpark translates the Python data frame manipulations into highly scalable and fast SQL code.

But then you might be wondering: What if you want to run custom Python code that can not be translated into SQL? Will that be fast too?

The answer is yes:

A dbt Python model with code that can’t be translated into SQL

On the next example I want to produce the same results as the previous Python model, but with more custom Python code — Instead of getting the Holidays out of a table join, we are going to use the Python holidays library:

# sf100_orders_annotated.py
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

import holidays


def model(dbt, session):
    dbt.config(
        packages = ["holidays"]
    )

    us_holidays = holidays.US(years=range(1990, 2030))

    @F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
    def is_holiday(x):
        return x in us_holidays

    df = dbt.ref('sf100_orders')
    df = df.withColumn("IS_HOLIDAY", is_holiday("O_ORDERDATE"))

    return df

In the above we are using a Python library that’s already available in Snowflake thanks to Anaconda (holidays).

What’s interesting about it is that we are taking the 150m dbt.ref(‘sf100_orders’) table and adding a column to it, that contains the result of applying the Python function is_holiday(“O_ORDERDATE”). This is a function that we defined right there, with a particular decorator: @F.udf().

What the @F.udf() decorator does is interesting: It tells Snowflake to convert this custom Python function into a Snowflake Python UDF. Then Snowflake can transform the data frame operations into scalable SQL code — that happens to call the just created Python UDF. Like this, as seen on the Snowflake logs:

SQL code generated by Snowpark out of our Python code — including a call to the Python UDF we just defined

Benchmarking the call of a Python UDF

Running this transformation took 36 seconds — not too bad for re-materializing 150 million rows, and calling an arbitrary Python UDF 150 million times:

You can see in the screenshot that the Python code executed 150 million times, taking in total 324 seconds. This was much faster for us, as Snowflake took care of parallelizing these calls.

Notes on performance

Python will be slower than pure SQL — but how much slower will depend a lot on your code. Small optimizations can have a huge impact.

For example, compare this UDF implementation that initializes the Holidays dict outside the UDF:

    us_holidays = holidays.US(years=range(1990, 2030))

    @F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
    def is_holiday(x):
        return x in us_holidays

Versus initializing the Holidays dict inside the UDF:

    @F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
    def is_holiday(x):
        us_holidays = holidays.US(years=range(1990, 2030))
        return x in us_holidays

The first example took 38 seconds to run on a L-wh over 150m rows.

The second example took 3 hours and 38 minutes to run on a L-wh over the same 150m rows (1hr 52m on a XL-wh).

Running an efficient Python UDF vs an inefficient Python UDF can have huge consequences

That’s a huge difference. Moving the initialization outside our UDF saves us a lot of time. Having it inside the UDF makes the UDF run in 5.54ms. Maybe 5.4ms doesn’t sound like much, until you consider that 5.54ms times 150 million becomes 9.6 days of CPU power.

The good news, as seen above: If you take care of initialization outside the UDF, the Snowpark @F.udf() decorator takes care of moving these values from the context into the UDF.

Designing Python UDFs | Snowflake Documentation

This topic helps you design Python UDFs. Note Vectorized Python UDFs let you define Python functions that receive…

docs.snowflake.com

The possibilities are endless

Now you can take these building blocks and unleash your imagination: What could you do in your dbt pipelines, if you could write arbitrary Python code transformations within any arbitrary step?

dbt takes care of pushing your Python code to Snowflake
Snowflake takes care of parallelizing the data frame transformations
Snowflake and Anaconda have made it easy for you to use tons of Python libraries
You can bring your own Python libraries

Next steps

Try the dbt Python models in your Snowflake account, and share your results.
Participate in the dbt community to shape the future of the dbt Python models.
Read the dbt Python model docs, and the Snowflake Snowpark for Python library docs.
Share your finding withs the dbt community on the dbt Slack channels #dbt-core-python-models and #db-snowflake.
Try some dbt+Snowflake quickstarts like “Data Engineering with Snowpark Python and dbt” and “Leverage dbt Cloud to Generate ML ready pipelines using Snowpark Python”.
Compare the simplicity, power, and performance of the dbt Python models on Snowflake — versus the set up that dbt had to pull off to run Python models in other platforms.
Check the previous posts by Jeremiah Hansen and Eda Johnson, “Data Engineering with Snowpark Python and dbt” and “A First Look at the dbt Python Models with Snowpark”. Also phData’s “How to Use dbt With Snowpark Python to Implement Sentiment Analysis” and “How to Build a Python Model in dbt with Snowflake”.
For ML, dbt has an in-depth guide “Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake”, and tropos published “Time-Series Forecasting With Python For Snowpark And Dbt Labs”.

Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake…

Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake

… Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflakedocs.getdbt.com

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.
Try dbt Cloud — which I used as a cool web IDE integrated to dbt and Snowflake to develop the examples in this post (it does a lot more than that too, but that’s a story for another post).

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

Oh, and I’m on threads.net/fhoffa now too :)

Felipe Hoffa (@fhoffa) on Threads

Data is life. Developer Advocate at Snowflake ❄️🥑. 264 Followers.

threads.net