A Subscriber Predictive Analytics demo with dbt Snowpark Python models

Date: November 2022

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

As of Nov 2022, Snowpark is officially Generally Available and Snowpark optimized WHs are in Public Preview. And of course, during the week of Coalesce, dbt Labs announced the support of dbt Python models with dbt Core v1.3 release in October 2022. (Please check out my first experimentation with dbt Python models in this post.)

I am extremely grateful that I was a part of the wonderful event, Coalesce, this year. dbt Labs has done an incredible job organizing this event and providing an amazing environment for collaboration and support for the wonderful dbt Community.

Empowering Pythonistas with dbt and Snowpark — Coalesce 2022

As online speakers, Venkat Sekar and I had the privilege to present ‘Empowering Pythonistas with dbt and Snowpark’ session (recording posted above; the slide deck is here), where we went over a Media and Entertainment industry solution demo with the new dbt Python models and Snowpark. Even though most features we demonstrated with dbt and Snowpark are brand new, our goal was to demonstrate the art of possibilities with dbt + Snowpark Python models and explore how we can operationalize a basic ML pipeline within a dbt DAG. Hence, please consider the demo we developed for this session as a starting point and think about the better and best practices while implementing similar scenarios for your use case. (Please see the dbt Labs Github discussion to contribute ideas around dbt python model best practices.)

As I promised our audience in the Coalesce slack channel for our session, I would like to share some highlights of the demo we developed for our session along with the code.

Demo Highlights:

  • The use case for the demo is churn prediction for a media company by analyzing the Subscriber data (a combination of generated datasets as well as Merkle Datasource dataset from the Snowflake Marketplace). Below are the 3 source tables in our dbt DAG:
  • A setup script creates a couple of Snowflake objects, such as databases, warehouses first for our dbt project. We used dbt Cloud to create our dbt project that has a mixture of SQL and Python models in the same DAG. We simply created SQL models in our dbt DAG first to join source tables, apply light transformations in our staging models. Our goal was to leverage Python models for Feature Engineering steps (e.g. one hot encoding), model training and batch inference in the dbt DAG.
dbt Lineage Graph
  • With your dbt Python models with Snowpark, the model python code is automatically wrapped in a python stored procedure and executed in Snowflake. The current default for dbt Snowpark models is permanent stored procs, where dbt automatically creates stored procs for each python model in your database. You can alternatively configure your dbt project to use anonymous stored procs for all python models in the project.

We went with the default option first, but decided to switch to anonymous stored procedures and configured this in our dbt_project.yml. (The execution time with anonymous stored procs was slightly faster, too!)

  • To determine the Feature Engineering steps, we started with Exploratory Data Analysis (EDA) in a Hex notebook. Hex, is an excellent notebook interface (the best I have ever seen!) that makes collaboration very easy. Hex has great integrations with Snowflake/Snowpark as well as dbt Cloud.
  • We also tried the open source AutoML python package, AutoGluon in another Hex notebook to try different models and confirm that XGBoost is the best model to predict churn with our data.
  • In our first Python model, intermediate/core/int_user_subscription_agg.py, we demonstrated how we can use a python package (ftfy), which is already in the Snowflake Anaconda repository, to clean up bad characters in a source table which had 795 million records. We created an anonymous python UDF with the Batch API to use the fix_text() function in this python package.
  • The second Python model, ml/features/int_user_encoded.py, applies one-hot encoding to 4 categorical columns (4 promising features to predict churn, 'CORE_MARITAL_STATUS', 'CORE_PRESENCE_OF_CHILDREN', 'CORE_HOMEOWNER' ,'HAS_HAD_TRIAL' ). With this one, we wanted to demonstrate how we can use a custom python package — sp4py_utilities, which includes scikit-learn preprocessing functionality that can be used with Snowpark Python dataframes. To use this package in a dbt python model, we had to save the package as a .zip file into a Snowflake stage (e.g.@lib_stg/libs/sp4py_utilities.zip) previously. (Please note that this step needs to be done outside of dbt at this time.)
  • The third Python model in our DAG, ml/features/int_train_test.py, is to demonstrate how we can train an ML model in a dbt DAG. This is obviously a forward-looking functionality at this time, but definitely a good starting point. We are using a Snowpark-optimized WH in this model as we are training 100K records using sklearn. This dbt model generates a model artifact and saves it as a joblib file into a Snowflake stage. We are also capturing some model training metadata leveraging Snowflake Python logging (This feature is still in Private Preview as of Nov 2022). This model ultimately returns a dataframe with train and test records with a new column to identify each.

As we mentioned in our session, we believe that there is a lot to do in this model for a real-life model training scenario incorporating MLOps practices such as versioning of data/model, measuring data/model drift and capturing other Ops metrics. Stay tuned for more improvements in the following months to address these…

  • Finally, marts/predict/user_churn_predict_xgboost.py model is a good example of how to create a Python UDF with batch API for batch prediction using the model artifact created by the int_train_test model. Please note that predict_churn function is registered as a vectorized/batch UDF using the Batch API. (This is the recommended way of creating UDFs for ML Batch prediction.) Ultimately, this dbt model returns a brand new dataframe with a new prediction column indicating whether a user will churn based on our ML model. The table materialization is intended to be used in a BI Dashboard.

We chose to take this approach for predictions in our demo; however, there could certainly be different approaches taken to create a prediction UDF in a dbt DAG like this. One alternative approach is creating a permanent UDF leveraging a dbt macro as described here . Then, using a SQL model that calls the predict() Python UDF in the dbt DAG to materialize a BI model with predictions.

That's it! I believe that some of these approaches will certainly evolve over time as we are all scratching the surface when it comes to incorporating ML practices into dbt Snowpark Python models.

I would love to hear your thoughts about some of these approaches and how you use dbt Snowpark Python models in your use case. Please do not hesitate to ping me on Slack or Github.

Thank you for your interest in our Coalesce 2022 session.

A special thanks to the wonderful people from dbt Labs and Snowflake (Amy Chen, Doug Beatty, Cody Peterson, Sri Chintala, Venkat Sekar and many others) for their collaboration and support.

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect