Data Engineering with Snowpark Python and dbt

Introduction

Data engineering is a broad discipline that includes data ingestion, data transformation, and data consumption, along with the accompanying SDLC best practices (i.e. DevOps). Data engineers employ different tools and approaches depending on the phase. For this blog post (and related Quickstart) we will focus on the data transformation phase in particular.

Data transformation involves taking source data that has been ingested into your data platform and cleansing it, combining it, and modeling it for downstream use. Historically the most popular way to transform data has been with the SQL language and data engineers have built data transformation pipelines using SQL often with the help of ETL/ELT tools. But recently, many folks have also begun adopting the DataFrame API in languages like Python for this task. For the most part a data engineer can accomplish the same data transformations with either approach, and deciding between the two is mostly a matter of preference. That being said, there are use cases where a particular data transform can’t be expressed in SQL and a different approach is needed. The most popular approach for these use cases is Python along with a DataFrame API.

dbt

Enter dbt. dbt is one of the most popular data transformation tools today. And until now, dbt has been entirely a SQL-based transformation tool. But with the release of dbt version 1.3, it’s now possible to create both SQL and Python based models in dbt! Here’s how dbt explains it:

dbt Python (“dbt-py”) models will help you solve use cases that can’t be solved with SQL. You can perform analyses using tools available in the open source Python ecosystem, including state-of-the-art packages for data science and statistics. Before, you would have needed separate infrastructure and orchestration to run Python transformations in production. By defining your Python transformations in dbt, they’re just models in your project, with all the same capabilities around testing, documentation, and lineage. (dbt Python models)

Snowflake

Python based dbt models are made possible by Snowflake’s new native Python support and Snowpark API for Python (Snowpark Python for short). Snowpark Python includes the following exciting capabilities:

  • Python (DataFrame) API
  • Python Scalar User Defined Functions (UDFs)
  • Python UDF Batch API (Vectorized UDFs)
  • Python Table Functions (UDTFs)
  • Python Stored Procedures
  • Integration with Anaconda

With Snowflake’s Snowpark Python capabilities, you no longer need to maintain, secure and pay for separate infrastructure/services to run Python code as it can now be run directly within Snowflake’s Enterprise grade data platform! For more details check out the Snowpark Developer Guide for Python.

Sneak Peek

So how do these new dbt Python models work? Imagine you had the following dbt SQL model:

Sample dbt SQL model

Well this is how we would rewrite that as a dbt Python model:

Sample dbt Python model

And that’s just a simple example, with Python dbt models you can create UDFs and import other Python libraries to do your advanced transformations.

All the Details

Want to learn more? To get your hands on this exciting new combination of technologies, please check out my new Snowflake Quickstart Data Engineering with Snowpark Python and dbt. That guide will provide step-by-step instructions for how to get started with Snowflake Snowpark Python and dbt’s new Python-based models!

Have fun, and please share any cool examples/use cases you come up with!

--

--

Jeremiah Hansen
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

I’m currently a Field CTO Principal Architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.