Snowpark for data science notebooks
Data science at scale with Snowpark & Deepnote
Earlier this year, Snowflake introduced Snowpark — a library for processing data in your warehouse, with the data science languages, code, and scientific tools you already use.
We took Snowpark for a spin, writing an end-to-end data science workflow, and TL;DR — it was incredible. In this article, we’ll walk you through the highlights and show you what makes Snowpark such a valuable tool for data scientists.
💡 Want to run the code for yourself? The full notebook and sample data are available here.
🛠 What we used
In our tutorial, we used notebooks in Deepnote as a control plane for our Snowflake warehouse. Deepnote is a collaborative data workspace that gives your notebooks superpowers — it allows for real-time collaboration, integrates with the rest of your data stack, and enables rapid exploration and prototyping. This way, we benefit from the agility of the Deepnote’s interface and the compute in Snowflake’s warehouse.
🧱 What we built
We built an end-to-end data science pipeline — including transformation, data exploration, model training and evaluation, and finally, model deployment. The data pipeline looks like this:
🎬 Let’s get started
1. Setting up Snowflake
In this section, you’ll see how Snowpark is used to connect to and populate your Snowflake warehouse. If you don’t have a Snowflake account, sign up for a free trial here.
To get started, we create the session
object, which allows us to interact with the Snowflake warehouse via their Pandas-like API. Simply pass your connection details to the configs
method to connect to your Snowflake instance.
🔒 Secrets are kept secret with Deepnote’s native Snowflake integration.
The notebook tutorial comes bundled with a sample parquet file containing 100k rows of telecommunications data. Using the session
object from Snowpark, we can upload that file as a table in our warehouse. For example:
Let’s check out our new table by displaying it as an interactive Pandas DataFrame in Deepnote.
This is perfect. We’ve set the “stage”, so-to-speak. Now it’s time for data transformation as we work towards a dataset we can use for machine learning.
2. Transforming the data
We begin by transforming our data with Snowpark’s main abstraction — the DataFrame, which contains methods for operating on the data. Importantly, all transformations are computed in the warehouse itself, not in the notebook.
🐼 If you are familiar with Python’s Pandas, you’re going to feel right at home working with Snowpark’s DataFrame abstraction.
You could write pure SQL against the session
object, but by using Snowpark’s DataFrame, we getting intelligent code completion, type checking, and a familiar Pandas-like API. At the same time, we’re not having to move our data out of the warehouse either; all the heavy lifting is being done in the Snowflake warehouse.
Below is an overview of the data pipeline and respective transformations. We start with our raw table (loaded from the parquet file), build up several transformations, and finally land on a table called TELCO_DATASET
which we can use to build a churn prediction model.
Here’s an example of how the TELCO_DATASET
was built using the DataFrame’s join
method from Snowpark 👇:
👌 Snowpark disambiguates duplicate column in the joined table by prepending a random string to the original column name. Chaining a new select statement gives us back a column simply called
CUSTOMERID
.
Now we have a table the data scientists can use for machine learning. Specifically, we’re going to use scikit-learn to build a churn prediction model based on the TELCO_DATASET
table.
3. Exploratory data analysis (EDA)
Now that the transformations are complete, our freshly minted TELCO_DATASET
can be used for model training and prediction. We will first do the requisite EDA to make sure our dataset is suitable for model training.
In Deepnote, a great deal of the exploratory work happens without writing any code, thanks to some built-in EDA magic 🦄.
- 👀 Rich dataframes: With Deepnote’s built in DataFrame viewer, we can already examine the dataset for data types, missing values, percentage of given categories, distributions, and more.
- 🔠 Built-in filters: The built-in filters make it easy to get a handle on the data and the relationships that might drive our future model’s predictions.
- 📊 Native charts and visualizations: No-code charts allow us to examine our target variable as a function of other features without having to write any additional code. We can shift from code into visualizations, and right back into querying seamlessly.
4. Model training
After doing the requisite EDA in Deepnote, we can train and evaluate the model. We applied a random forest classifier and took a look at the confusion matrix to evaluate the accuracy of the classifications.
After evaluating our model’s accuracy (see full notebook tutorial), we can now deploy this model directly on the Snowflake warehouse—without changing our code, package environment, or moving our data 🚀.
😱 Thanks to scikit-learn’s pipeline, all the training preprocessing that we’ve done will also be applied to new predictions made in the warehouse.
5. Model deployment
Snowpark’s UDFs (user-defined functions) are functions that are defined in pure Python but run on Snowflake compute. This is a perfect fit for our newly created model. UDFs are decorated with @udf
and require a few other details beyond a standard Python function. Read the Snowpark documentation to learn more.
Let’s use a Snowpark UDF to move our model onto Snowflake server where it can be used for predictions with incoming data.
The only thing left to do is use the model to make some predictions. We’ll take a sample from theTELCO_DATASET
table to use for predictions.
💡 Note that Snowpark’s UDFs are scalar functions, meaning that they return a result for every row in the results set of your query.
👆 The Snowpark select
statement accomplished the following:
- It selected the
CUSTOMER_ID
,CHURNVALUE
, and the output of thepredict_churn
function we wrote earlier (aliased toPREDICTED CHURN
). - It saved the results set to a new table in our warehouse called
CHURN_DETECTION
.
Now we can query the new CHURN_DETECTION
table with Deepnote’s native SQL blocks to check it out.
🪄 Et voilà, the PREDICTED_CHURN
column shows the result of our model predictions alongside the customer ID and actual churn value—in a brand new table.
✅ Conclusion
With Deepnote and Snowpark, the entire database feels like an in-memory object, similar to the experience of using Pandas, but at a much larger scale. The transformation and scientific computation happen where the datasets live, without having to change tools, libraries, or languages.
Whether you’re a data scientist, analytics engineer, data engineer, developer or an analyst — if you’re using Snowpark, Deepnote will be a great addition to your toolbox. If Python is not your jam, Snowpark also supports a growing number of different programming languages including Scala and Java.
Get Started with Deepnote & Snowflake
Sign up for Deepnote for free.