Diamond Price Prediction: End-to-End Machine Learning with Snowpark ML in Snowflake Notebooks

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readApr 8, 2024

Snowflake Notebooks (currently in Private Preview) provide a native development experience directly in the Snowsight UI. Notebooks allow analysts, data scientists, and data engineers to streamline their workflows, develop AI/ML models out of the box with direct access to flexible, distributed compute, and accelerate moving from prototype to production across teams, all directly in the Snowflake platform. Machine Learning practitioners in particular can benefit from how easy it is to get started processing data, building models, and deploying to production all natively in Snowflake using Snowflake Notebooks.

Last year, as part of Snowpark ML becoming Generally Available, we published the Intro to Machine Learning with Snowpark ML for Python Quickstart, which showed how easy it is to build end-to-end machine learning workflows in Snowflake from anywhere. In this post, we will walk through how easy it is to build this same workflow directly in Snowsight, using the new Snowflake Notebooks. We’ll build an end-to-end ML workflow leveraging the classic Diamonds dataset, where we will build and deploy a machine learning model to predict diamond price based on the diamond characteristics.

Note: the notebooks and instructions for everything shown in this blog post are publicly available on GitHub and in the Snowflake Solutions Center.

Getting started with notebooks in Snowsight is just as easy as spinning up a SQL worksheet- from Snowsight, simply navigate to the Projects > Notebooks menu option:

And you can easily create a new notebook, or upload an existing .ipynb file. Let’s upload the first notebook from our quickstart, which is our data preprocessing workflow:

Configure your notebook name, location, and warehouse

Once you click Create, a notebook is created, and you are taken directly to it. From here, you can begin writing and executing both SQL and Python code all in one place, annotating your code with Markdown, and more.

What if you need additional packages in your Notebook? Snowflake Notebooks come pre-configured with a number of common packages, such as the Snowpark Python Client API. For our first data preprocessing notebook, we also need the Snowpark ML (snowflake-ml-python) package. To add it to our notebook, simply select it from the Packages drop-down menu:

Once you’ve selected the packages you need, click “Start”, and the virtual warehouse for your notebook will spin up, and you get started executing your code.

Using Snowpark to do dataframe-style processing is incredibly simple in Snowflake notebooks. The notebook comes with an active Snowpark session which can be retrieved via a simple get_active_session() function:

Once we have our session, we can do all kinds of normal dataframe-style processing using Python, visualize our output, and persist data to tables in our Snowflake account.

In an end-to-end ML workflow, once our basic data preparation has been performed, we also need to do some additional analysis and feature engineering to prepare our data for model training. The second notebook in our quickstart allows us to do just that. In addition to the preprocessing methods provided by SnowparkML, Snowflake Notebooks also support common visualization libraries such as matplotlib, seaborn, and more that we can use to visually analyze our training data and perform feature selection:

Plot and visualize data using common libraries

We can bundle our preprocessing steps up together as a Pipeline object, and save that as an artifact to use later during training and inference.

Once our training data is prepared, we can use SnowparkML to fit a predictive model to our data, and even perform parallelized hyperparameter tuning (e.g. using GridSearchCV), all from the Notebook interface:

We can also use Snowflake native Model Registry (currently in Public Preview) to log, version, and compare different iterations of our models.

Once we have a model that we want to deploy for inference, we can do that using a single line of code. Models that are deployed can be invoked using Python APIs:

One of the great features of Snowflake Notebooks is being able to intermingle SQL and Python code. So, while we can call our model using Python APIs, we can also invoke it directly via SQL using a SQL cell in the notebook:

Combine SQL and Python cells in the same notebook

Snowflake Notebooks allow you to run full end-to-end machine learning workflows without ever having to leave the Snowsight UI. You have all the flexibility, price-performance balance, and ease of use that you are accustomed to with Snowflake and Snowpark, now all natively in the Snowflake UI in a familiar, easy-to-use notebook interface.

For more information about getting started with Snowflake Notebooks, check out the docs and our Snowflake Notebooks 101 YouTube playlist!

Diamond Price Prediction: End-to-End Machine Learning with Snowpark ML in Snowflake Notebooks

Written by Caleb Baechtold