How Snowflake makes the dbt Python models shine

Now that dbt has made Python a seamless option within your SQL pipelines, it’s super cool to see how they made this possible with Snowflake. That’s what this post is about. Are you ready to try this out?

Felipe Hoffa
Snowflake

--

Image generated by AI

What is dbt

Before we start, let me explain dbt — especially for data engineers that still have not worked with it: “dbt” is a tool that helps us manage all the transformations we do with SQL to data in our databases.

For example, let’s say we have a table with all our customer orders, and we want to keep a count of how many of these orders happened during on a Holiday. We could solve this with 2 queries in SQL:

  1. Add a column is_holiday to the table sf100_orders, by joining it to a table with all_holidays. Call the resulting table sf100_orders_annotated.
  2. Count the rows that happen to have is_holiday=true on the table sf100_orders_annotated.

These 2 steps that combine the source data, as visualized by dbt:

Lineage graph — from the 2 source tables a table with a count of the Holidays.

We can use dbt to write these 2 transformations as “dbt models”, which are files that contain SQL and a little bit of dbt configuration (when necessary):

-- sf100_orders_annotated.sql
{{ config(materialized='table') }}

select a.*, b.date is not null is_holiday
from {{ref('sf100_orders')}} a
left join {{ref('all_holidays')}} b
on a.o_orderdate = b.date
-- count_holidays.sql
{{config(materialized='table')}}

select count(*) total_days
, count_if(is_holiday) holiday_count
, holiday_count/total_days ratio
from {{ref('sf100_orders_annotated')}}

These 2 files look like typical SQL — with a twist. Instead of having references to tables, dbt allows us to refer to other “models” with the use of annotations{{ref(‘sf100_orders_annotated’)}} . This is how dbt determines the lineage and dependencies between all the transformations our data goes through.

That’s the gist of dbt: Write SQL transformations in a well organized way, to keep your data processes well tested, documented, and versioned.

What’s new in dbt is that we can now write these transformations not only in SQL, but also in Python — while keeping everything else working as expected.

Our first dbt Python model

To show off how easy is to incorporate Python within a dbt pipeline, let’s change the above sf100_orders_annotated.sql into sf100_orders_annotated.py:

# sf100_orders_annotated.py
import snowflake.snowpark.functions as F

def model(dbt, session):
df_sf100 = dbt.ref('sf100_orders')
df_holidays = dbt.ref('all_holidays')

joined_df = df_sf100.join(
df_holidays,
df_sf100['o_orderdate'] == df_holidays['date'],
'left'
)
joined_df = joined_df.withColumn(
'is_holiday',
F.col('date').isNotNull()
)

result_df = joined_df.select(df_sf100['*'], 'is_holiday')
return result_df

What’s interesting in this code:

  • We were able to replace a SQL dbt model with a model written in pure Python.
  • dbt still takes care of connecting to all dependencies by using references to them, like indbt.ref(‘sf100_orders’) anddbt.ref(‘all_holidays’).
  • It doesn’t matter if these dependencies are SQL or Python dbt models. They are just a source of data for our current model.
  • All we need to return at the end of this model is a dataframe. Then dbt takes care of materializing the dataframe into a table the other models can use — just like it did with the equivalent SQL model.
  • To produce the desired join between the tables we used familiar PySpark dataframe manipulation functions.
  • But this is not PySpark — these are Snowpark Python Dataframes, which bring us great advantages (that we will review next).
Showing off that doing this join with Dataframes is familiar to anyone that understands PySpark — even to 2021-bound ChatGPT.

Benchmarking this dbt Python model on Snowflake

The orders table has 150 million rows (straight out of TPC-H-SF100), and we are joining it with a holidays table that has 458 rows. The join should be fast, and then most of the time will be spent re-materializing the table into a new table with the same 150 million rows, now with the new is_holiday column.

With the classic dbt SQL model: 15 seconds, to join the 2 tables and re-materialize the large one.

With the dbt Python model: 15 seconds (+9s of Python code initialization).

This is a nice result: It means Snowflake was able to parallelize the Python code to transform and materialize the resulting table, at virtually the same speed than pure SQL.

How the dbt Python models work on Snowflake

What dbt does to make the Python models work in Snowflake is pretty cool: They take your model code and push it (mostly untouched) into Snowflake, wrapped in a Snowflake Python Stored Procedure. This wrapping code takes care of setting up the right libraries, materializing the resulting data frame, and executing the stored procedure.

You can find exactly how dbt pulls this off by looking into your Snowflake logs:

How dbt wraps and pushes your Python code into a Snowflake Python Stored Procedure

Setting this up is what took dbt and Snowflake around 9 seconds — but then to execute the join as defined in the Python code, it only took 15 seconds while re-materializing the 150m rows. What’s the secret sauce?

This is the “magic” that the Snowpark Python Dataframes can pull off: All the dataframe manipulation (that we coded in Python), gets translated by Snowflake into a SQL query. You can find the executed query in your Snowflake logs too:

The SQL code that Snowpark generated and executed out of our Python data frame transformations

This is really cool:

  • Your Python experts can keep writing Python, even within dbt pipelines.
  • Your Python code runs inside Snowflake, as dbt takes care of setting this up.
  • Snowpark translates the Python data frame manipulations into highly scalable and fast SQL code.

But then you might be wondering: What if you want to run custom Python code that can not be translated into SQL? Will that be fast too?

The answer is yes:

A dbt Python model with code that can’t be translated into SQL

On the next example I want to produce the same results as the previous Python model, but with more custom Python code — Instead of getting the Holidays out of a table join, we are going to use the Python holidays library:

# sf100_orders_annotated.py
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

import holidays


def model(dbt, session):
dbt.config(
packages = ["holidays"]
)

us_holidays = holidays.US(years=range(1990, 2030))

@F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
return x in us_holidays

df = dbt.ref('sf100_orders')
df = df.withColumn("IS_HOLIDAY", is_holiday("O_ORDERDATE"))

return df

In the above we are using a Python library that’s already available in Snowflake thanks to Anaconda (holidays).

What’s interesting about it is that we are taking the 150m dbt.ref(‘sf100_orders’) table and adding a column to it, that contains the result of applying the Python function is_holiday(“O_ORDERDATE”). This is a function that we defined right there, with a particular decorator: @F.udf().

What the @F.udf() decorator does is interesting: It tells Snowflake to convert this custom Python function into a Snowflake Python UDF. Then Snowflake can transform the data frame operations into scalable SQL code — that happens to call the just created Python UDF. Like this, as seen on the Snowflake logs:

SQL code generated by Snowpark out of our Python code — including a call to the Python UDF we just defined

Benchmarking the call of a Python UDF

Running this transformation took 36 seconds — not too bad for re-materializing 150 million rows, and calling an arbitrary Python UDF 150 million times:

Query stats

You can see in the screenshot that the Python code executed 150 million times, taking in total 324 seconds. This was much faster for us, as Snowflake took care of parallelizing these calls.

Notes on performance

Python will be slower than pure SQL — but how much slower will depend a lot on your code. Small optimizations can have a huge impact.

For example, compare this UDF implementation that initializes the Holidays dict outside the UDF:

    us_holidays = holidays.US(years=range(1990, 2030))

@F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
return x in us_holidays

Versus initializing the Holidays dict inside the UDF:

    @F.udf(input_types=[T.DateType()], return_type=T.BooleanType())
def is_holiday(x):
us_holidays = holidays.US(years=range(1990, 2030))
return x in us_holidays

The first example took 38 seconds to run on a L-wh over 150m rows.

The second example took 3 hours and 38 minutes to run on a L-wh over the same 150m rows (1hr 52m on a XL-wh).

Running an efficient Python UDF vs an inefficient Python UDF can have huge consequences

That’s a huge difference. Moving the initialization outside our UDF saves us a lot of time. Having it inside the UDF makes the UDF run in 5.54ms. Maybe 5.4ms doesn’t sound like much, until you consider that 5.54ms times 150 million becomes 9.6 days of CPU power.

The good news, as seen above: If you take care of initialization outside the UDF, the Snowpark @F.udf() decorator takes care of moving these values from the context into the UDF.

The possibilities are endless

Now you can take these building blocks and unleash your imagination: What could you do in your dbt pipelines, if you could write arbitrary Python code transformations within any arbitrary step?

  • dbt takes care of pushing your Python code to Snowflake
  • Snowflake takes care of parallelizing the data frame transformations
  • Snowflake and Anaconda have made it easy for you to use tons of Python libraries
  • You can bring your own Python libraries
Image generated by AI

Next steps

Want more?

  • Try this out with a Snowflake free trial account — you only need an email address to get started.
  • Try dbt Cloud — which I used as a cool web IDE integrated to dbt and Snowflake to develop the examples in this post (it does a lot more than that too, but that’s a story for another post).

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

Oh, and I’m on threads.net/fhoffa now too :)

--

--

Felipe Hoffa
Snowflake

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.