Diamond Price Prediction: End-to-End Machine Learning with Snowpark ML in Snowflake Notebooks

Snowflake Notebooks (currently in Private Preview) provide a native development experience directly in the Snowsight UI. Notebooks allow analysts, data scientists, and data engineers to streamline their workflows, develop AI/ML models out of the box with direct access to flexible, distributed compute, and accelerate moving from prototype to production across teams, all directly in the Snowflake platform. Machine Learning practitioners in particular can benefit from how easy it is to get started processing data, building models, and deploying to production all natively in Snowflake using Snowflake Notebooks.

Last year, as part of Snowpark ML becoming Generally Available, we published the Intro to Machine Learning with Snowpark ML for Python Quickstart, which showed how easy it is to build end-to-end machine learning workflows in Snowflake from anywhere. In this post, we will walk through how easy it is to build this same workflow directly in Snowsight, using the new Snowflake Notebooks. We’ll build an end-to-end ML workflow leveraging the classic Diamonds dataset, where we will build and deploy a machine learning model to predict diamond price based on the diamond characteristics.

Note: the notebooks and instructions for everything shown in this blog post are publicly available on GitHub and in the Snowflake Solutions Center.

Getting started with notebooks in Snowsight is just as easy as spinning up a SQL worksheet- from Snowsight, simply navigate to the Projects > Notebooks menu option:

Notebooks menu

And you can easily create a new notebook, or upload an existing .ipynb file. Let’s upload the first notebook from our quickstart, which is our data preprocessing workflow:

Create or import an existing notebook
Configure your notebook name, location, and warehouse

Once you click Create, a notebook is created, and you are taken directly to it. From here, you can begin writing and executing both SQL and Python code all in one place, annotating your code with Markdown, and more.

A Snowflake Notebook!

What if you need additional packages in your Notebook? Snowflake Notebooks come pre-configured with a number of common packages, such as the Snowpark Python Client API. For our first data preprocessing notebook, we also need the Snowpark ML (snowflake-ml-python) package. To add it to our notebook, simply select it from the Packages drop-down menu:

Select additional packages

Once you’ve selected the packages you need, click “Start”, and the virtual warehouse for your notebook will spin up, and you get started executing your code.

Using Snowpark to do dataframe-style processing is incredibly simple in Snowflake notebooks. The notebook comes with an active Snowpark session which can be retrieved via a simple get_active_session() function:

Use a Snowpark Session

Once we have our session, we can do all kinds of normal dataframe-style processing using Python, visualize our output, and persist data to tables in our Snowflake account.

Dataframe transformations

In an end-to-end ML workflow, once our basic data preparation has been performed, we also need to do some additional analysis and feature engineering to prepare our data for model training. The second notebook in our quickstart allows us to do just that. In addition to the preprocessing methods provided by SnowparkML, Snowflake Notebooks also support common visualization libraries such as matplotlib, seaborn, and more that we can use to visually analyze our training data and perform feature selection:

Plot and visualize data using common libraries

We can bundle our preprocessing steps up together as a Pipeline object, and save that as an artifact to use later during training and inference.

Build a preprocessing pipeline

Once our training data is prepared, we can use SnowparkML to fit a predictive model to our data, and even perform parallelized hyperparameter tuning (e.g. using GridSearchCV), all from the Notebook interface:

Run parallelized hyperparameter tuning

We can also use Snowflake native Model Registry (currently in Public Preview) to log, version, and compare different iterations of our models.

Register and version models

Once we have a model that we want to deploy for inference, we can do that using a single line of code. Models that are deployed can be invoked using Python APIs:

Deploy models for inference

One of the great features of Snowflake Notebooks is being able to intermingle SQL and Python code. So, while we can call our model using Python APIs, we can also invoke it directly via SQL using a SQL cell in the notebook:

Combine SQL and Python cells in the same notebook

Snowflake Notebooks allow you to run full end-to-end machine learning workflows without ever having to leave the Snowsight UI. You have all the flexibility, price-performance balance, and ease of use that you are accustomed to with Snowflake and Snowpark, now all natively in the Snowflake UI in a familiar, easy-to-use notebook interface.

For more information about getting started with Snowflake Notebooks, check out the docs and our Snowflake Notebooks 101 YouTube playlist!

--

--

Caleb Baechtold
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

ML/AI Field CTO @ Snowflake. Mathematician, artist & data nerd. Alumnus of the Johns Hopkins University. @clbaechtold — Opinions my own