Snowpark for data science notebooks

Data science at scale with Snowpark & Deepnote

Data pipeline and DAG. Image by author.

Earlier this year, Snowflake introduced Snowpark — a library for processing data in your warehouse, with the data science languages, code, and scientific tools you already use.

We took Snowpark for a spin, writing an end-to-end data science workflow, and TL;DR — it was incredible. In this article, we’ll walk you through the highlights and show you what makes Snowpark such a valuable tool for data scientists.

💡 Want to run the code for yourself? The full notebook and sample data are available here.

🛠 What we used

In our tutorial, we used notebooks in Deepnote as a control plane for our Snowflake warehouse. Deepnote is a collaborative data workspace that gives your notebooks superpowers — it allows for real-time collaboration, integrates with the rest of your data stack, and enables rapid exploration and prototyping. This way, we benefit from the agility of the Deepnote’s interface and the compute in Snowflake’s warehouse.

🧱 What we built

We built an end-to-end data science pipeline — including transformation, data exploration, model training and evaluation, and finally, model deployment. The data pipeline looks like this:

Data pipeline. Image by author.

🎬 Let’s get started

1. Setting up Snowflake

In this section, you’ll see how Snowpark is used to connect to and populate your Snowflake warehouse. If you don’t have a Snowflake account, sign up for a free trial here.

To get started, we create the session object, which allows us to interact with the Snowflake warehouse via their Pandas-like API. Simply pass your connection details to the configs method to connect to your Snowflake instance.

🔒 Secrets are kept secret with Deepnote’s native Snowflake integration.

The notebook tutorial comes bundled with a sample parquet file containing 100k rows of telecommunications data. Using the session object from Snowpark, we can upload that file as a table in our warehouse. For example:

Uploading a parquet file to Snowflake

Let’s check out our new table by displaying it as an interactive Pandas DataFrame in Deepnote.

Snowflake table returned as an interactive Pandas Dataframe

This is perfect. We’ve set the “stage”, so-to-speak. Now it’s time for data transformation as we work towards a dataset we can use for machine learning.

2. Transforming the data

We begin by transforming our data with Snowpark’s main abstraction — the DataFrame, which contains methods for operating on the data. Importantly, all transformations are computed in the warehouse itself, not in the notebook.

🐼 If you are familiar with Python’s Pandas, you’re going to feel right at home working with Snowpark’s DataFrame abstraction.

You could write pure SQL against the session object, but by using Snowpark’s DataFrame, we getting intelligent code completion, type checking, and a familiar Pandas-like API. At the same time, we’re not having to move our data out of the warehouse either; all the heavy lifting is being done in the Snowflake warehouse.

Below is an overview of the data pipeline and respective transformations. We start with our raw table (loaded from the parquet file), build up several transformations, and finally land on a table called TELCO_DATASET which we can use to build a churn prediction model.

Data pipeline and DAG. Image by author.

Here’s an example of how the TELCO_DATASET was built using the DataFrame’s join method from Snowpark 👇:

👌 Snowpark disambiguates duplicate column in the joined table by prepending a random string to the original column name. Chaining a new select statement gives us back a column simply called CUSTOMERID.

Inner join of the DEMOGRAPHICS and SERVICES table with Snowpark

Now we have a table the data scientists can use for machine learning. Specifically, we’re going to use scikit-learn to build a churn prediction model based on the TELCO_DATASET table.

3. Exploratory data analysis (EDA)

Now that the transformations are complete, our freshly minted TELCO_DATASET can be used for model training and prediction. We will first do the requisite EDA to make sure our dataset is suitable for model training.

In Deepnote, a great deal of the exploratory work happens without writing any code, thanks to some built-in EDA magic 🦄.

No-code EDA with Deepnote
  • 👀 Rich dataframes: With Deepnote’s built in DataFrame viewer, we can already examine the dataset for data types, missing values, percentage of given categories, distributions, and more.
  • 🔠 Built-in filters: The built-in filters make it easy to get a handle on the data and the relationships that might drive our future model’s predictions.
  • 📊 Native charts and visualizations: No-code charts allow us to examine our target variable as a function of other features without having to write any additional code. We can shift from code into visualizations, and right back into querying seamlessly.

4. Model training

After doing the requisite EDA in Deepnote, we can train and evaluate the model. We applied a random forest classifier and took a look at the confusion matrix to evaluate the accuracy of the classifications.

scikit-learn’s random classifier for churn prediction

After evaluating our model’s accuracy (see full notebook tutorial), we can now deploy this model directly on the Snowflake warehouse—without changing our code, package environment, or moving our data 🚀.

😱 Thanks to scikit-learn’s pipeline, all the training preprocessing that we’ve done will also be applied to new predictions made in the warehouse.

5. Model deployment

Snowpark’s UDFs (user-defined functions) are functions that are defined in pure Python but run on Snowflake compute. This is a perfect fit for our newly created model. UDFs are decorated with @udf and require a few other details beyond a standard Python function. Read the Snowpark documentation to learn more.

Let’s use a Snowpark UDF to move our model onto Snowflake server where it can be used for predictions with incoming data.

Deploying a model to Snowflake compute with a UDF

The only thing left to do is use the model to make some predictions. We’ll take a sample from theTELCO_DATASET table to use for predictions.

💡 Note that Snowpark’s UDFs are scalar functions, meaning that they return a result for every row in the results set of your query.

Returning predictions as a new table in the warehouse

👆 The Snowpark select statement accomplished the following:

  • It selected theCUSTOMER_ID, CHURNVALUE, and the output of the predict_churn function we wrote earlier (aliased to PREDICTED CHURN).
  • It saved the results set to a new table in our warehouse called CHURN_DETECTION.

Now we can query the new CHURN_DETECTION table with Deepnote’s native SQL blocks to check it out.

Returning results into the notebook with SQL

🪄 Et voilà, the PREDICTED_CHURN column shows the result of our model predictions alongside the customer ID and actual churn value—in a brand new table.

✅ Conclusion

With Deepnote and Snowpark, the entire database feels like an in-memory object, similar to the experience of using Pandas, but at a much larger scale. The transformation and scientific computation happen where the datasets live, without having to change tools, libraries, or languages.

Whether you’re a data scientist, analytics engineer, data engineer, developer or an analyst — if you’re using Snowpark, Deepnote will be a great addition to your toolbox. If Python is not your jam, Snowpark also supports a growing number of different programming languages including Scala and Java.

Get Started with Deepnote & Snowflake
Sign up for Deepnote for free.

--

--