Snowflake Stored Procedures vs dbt Python Models
This article is part of a series titled “Weekly Pain Points”, in which I discuss challenges encountered in my job and the solutions I’ve found. These are not necessarily the only or the best solutions, but rather an opportunity to share the fruits of my reflections. I welcome other ideas, solutions, or comments to further enrich the analysis.
I recently faced the following question at my job: what choice to make between Snowflake Stored Procedures and dbt Python models for deploying Machine Learning (ML) models on Snowflake?
Snowflake Stored Procedures vs dbt Python Models
In order to deploy ML models on Snowflake using Snowpark, two solutions were available to me:
- Snowflake Stored Procedures
- dbt Python Models
Snowpark is a development library that allows for deploying complex data processing applications on Snowflake in a simple and native way without moving data and using well-known programming languages (Java, Python, Scala). Learn more.
1. Snowflake Stored Procedures
Each of the two solutions had its advantages and disadvantages. Regarding stored procedures, it’s Snowflake’s native option for deploying Snowpark applications. Moreover, thanks to Snowflake’s Python Worksheets (Public Preview), one can easily write, test, and deploy their code from the Snowsight console.
However, a challenge I faced with this method concerns the difficulty in isolating and testing the intermediate steps of the machine learning model. Indeed, before training and executing an ML model, various data preparation and transformation steps are essential. These steps are often integrated directly into the model code, creating a black box and increasing project complexity.
Additionally, in a project like mine, where data preparation and transformation are performed via dbt, moving the execution of some transformations to stored procedures external to the dbt project introduces additional complexity, making pipeline maintenance more difficult.
For this first option, I arrived at the following conclusions:
Pros:
- Native solution for deploying Snowpark code.
- Ease of development and debugging thanks to Snowflake’s Python Worksheets (Public Preview).
Cons:
- Lack of modularity and visibility on the intermediate steps of the stored procedure.
- Silos between data preparation and modeling pipelines and those executing ML models.
dbt Python Models
Contrary to stored procedures, dbt models are not native objects of Snowflake.
dbt (data build tool) is an open-source data processing framework, based on Python, that enables transformations on data within a data warehouse such as Snowflake, Redshift, BigQuery, but also PostgreSQL.
It’s important to note that dbt Python models are transformed into stored procedures before execution on Snowflake. Thus, the final code executed by these dbt Python models on Snowflake also takes the form of stored procedures.
However, the aim here was to highlight the differences between Snowflake’s native approach and what might be called the “customized” approach offered by dbt.
The major advantage of dbt lies in the modularity of its models. Each transformation step is encapsulated in a specific model, thus facilitating its documentation and independent testing. Moreover, the use of the ‘ref()’ function in dbt contributes to establishing a coherent and easily understandable execution flow, effectively linking all models, whether in SQL or Python.
The second notable advantage of using dbt Python models lies in improving project collaboration and maintenance. The ability to write and execute Snowpark code directly from dbt facilitates the unification of data transformation pipelines and ML model execution processes within a single project.
However, this method has constraints, particularly in terms of Python code development and debugging. Unlike dbt’s SQL models, which can be written and tested step by step in a code editor, Python models do not allow for dynamic testing or the use of functions such as ‘print()’ to display intermediate results. Each Python model must be compiled before execution on Snowflake, and only the final result is available.
I have summarized the second option as follows:
Pros:
- Modularity, documentation, testing of each step of SQL and Python models.
- Unification and simplification of data preparation pipelines, transformation, training, and execution of models.
Cons:
- Necessity to integrate dbt into the technical stack.
- Need to build expertise in dbt.
- Difficulty in debugging Python models (use of Python Worksheets as a complement).
Conclusion
Snowflake is one of the best Cloud Data platforms available. The ease with which one can create, manage, and scale computing clusters (virtual warehouses) allows data engineering teams to focus on their core business and forget about infrastructure maintenance concerns.
However, although Snowflake aspires to become an all-in-one solution, as evidenced by recent developments such as Snowpark, Python Worksheets, Streamlit, and advancements in the DevOps field on Snowflake, I am convinced that Snowflake integrates best in an ecosystem including tools like dbt. In my opinion, dbt offers an optimal experience in terms of developing and deploying reliable and robust data transformation pipelines on Snowflake.
Considering the elements presented in this document, I finally opted for the second option, and deployed a stock forecasting model on Snowflake 📈
Resources:
The opinions expressed in this article are strictly my own and do not engage anyone else.