Getting Started with Snowpark Using a Jupyter Notebook and the Snowpark Dataframe API

Introduction

During the Snowflake Summit 2021, Snowflake announced a new developer experience called Snowpark for public preview.

Snowpark brings deeply integrated, DataFrame-style programming to the languages developers like to use, and functions to help you expand more data use cases easily, all executed inside of Snowflake. Snowpark support starts with Scala API, Java UDFs, and External Functions.

With Snowpark, developers can program using a familiar construct like the DataFrame, and bring in complex transformation logic through UDFs, and then execute directly against Snowflake’s processing engine, leveraging all of its performance and scalability characteristics in the Data Cloud.

Snowpark provides several benefits over how developers have designed and coded data-driven solutions in the past:

  • Simplifies architecture and data pipelines by bringing different data users to the same data platform, and processes against the same data without moving it around.
  • Accelerates data pipeline workloads by executing with performance, reliability, and scalability with Snowflake’s elastic performance engine.
  • Eliminates maintenance and overhead with managed services and near-zero maintenance.
  • Creates a single governance framework and a single set of policies to maintain by using a single platform.
  • Provides a highly secure environment with administrators having full control over which libraries are allowed to execute inside the Java/Scala runtimes for Snowpark.

The following tutorial shows how you how to get started with Snowpark in your own environment in several hands-on examples using Jupyter Notebooks. You will find installation instructions for all necessary resources in the Snowflake Quickstart Tutorial.

For better readability of this post, code sections are screenshots, e.g. the code can not be copied. But don’t worry, all code is hosted on Snowflake-Labs in a github repo. So if you like to run / copy or just review the code, head over to then github repo and you can copy the code directly from the source.

Prerequisite

Before you can start with the tutorial you need to install docker on your local machine. However, if you can’t install docker on your local machine you are not out of luck. Just follow the instructions below on how to create a Jupyter Notebook instance in AWS.

Running Jupyter on a Laptop using Docker

If you do have permission on your local machine to install Docker, follow the instructions on Dockers website for your operating system (Windows/Mac/Linux). All following instructions are assuming that you are running on Mac or Linux. However, Windows commands just differ in the path separator (e.g. forward slash vs backward slash). Please note, that the code for the following sections is available in the github repo.

  1. Download and install Docker

2. Clone the github repo

3. Build the Docker container

4. Starting your Jupyter environment
Type the following commands to start the container and mount the Snowpark Lab directory to the container. The command below assumes that you have cloned the repo to ~/DockerImages/sfguide_snowpark_on_jupyterJupyter. Adjust the path if necessary.

Start a browser session (Safari, Chrome, …). Paste the line with the local host address (127.0.0.1) printed in your shell window into the browser status bar and update the port (8888) to your port in case you have changed the port in the step above.

5. Stopping your Jupyter environment
Type the following command into a new shell window when you want to stop the tutorial. All changes/work will be saved on your local machine.

Part 1: The Snowpark DataFrame API

This is the first notebook of a series to show how to use Snowpark on Snowflake. This notebook provides a quick-start guide and an introduction to the Snowpark DataFrame API. The notebook explains the steps for setting up the environment (REPL), and how to resolve dependencies to Snowpark. After a simple “Hello World” example you will learn about the Snowflake DataFrame API, projections, filters, and joins.

The full code for all examples can be found on GitHub in the notebook directory. The complete code for this post is in part1.

Prerequisite

To get started you need a Snowflake account and read/write access to a database. If you do not have a Snowflake account, you can sign up for a free trial. It doesn’t even require a credit card.

Quick Start

First, we have to set up the environment for our notebook. If you want to learn more about each step, head over to the Snowpark documentation in section configuring-the-jupyter-notebook-for-snowpark. Otherwise, just review the steps below.

Step 1

Configure the notebook to use a Maven repository for a library that Snowpark depends on.

Step 2

Create a directory (if it doesn’t exist) for temporary files created by the REPL environment. To avoid any side effects from previous runs, we also delete any files in that directory.

Note: Make sure that you have the operating system permissions to create a directory in that location.

Note: If you are using multiple notebooks, you’ll need to create and configure a separate REPL class directory for each notebook.

Step 3

Configure the compiler for the Scala REPL. This does the following:

  • Configures the compiler to generate classes for the REPL in the directory that you created earlier.
  • Configures the compiler to wrap code entered in the REPL in classes, rather than in objects.
  • Adds the directory that you created earlier as a dependency of the REPL interpreter.

Step 4

Import the Snowpark library from Maven.

To create a session, we need to authenticate ourselves to the Snowflake instance. Though it might be tempting to just override the authentication variables below with hard coded values, it’s not considered best practice to do so. If you share your version of the notebook, you might disclose your credentials by mistake to the recipient. Even worse, if you upload your notebook to a public code repository, you might advertise your credentials to the whole world. To prevent that, you should keep your credentials in an external file (like we are doing here).

Then, update your credentials in that file and they will be saved on your local machine. Even better would be to switch from user/password authentication to private key authentication.

Copy the credentials template file creds/template_credentials.txt to creds/credentials.txt and update the file with your credentials. Put your key files into the same directory or update the location in your credentials file.

Step 5

Add the Ammonite kernel classes as dependencies for your UDF.

Hello World

Congratulations! You have successfully connected from a Jupyter Notebook to a Snowflake instance. Now we are ready to write our first “Hello World” program using Snowpark. That is as easy as the line in the cell below.

Note that Snowpark has automatically translated the Scala code into the familiar “Hello World!” SQL statement. This means that we can execute arbitrary SQL by using the sql method of the session class.

However, this doesn’t really show the power of the new Snowpark API. At this point it’s time to review the Snowpark API documentation. It provides valuable information on how to use the Snowpark API.

Let’s now create a new Hello World! cell, that uses the Snowpark API, specifically the DataFrame API. To use the DataFrame API we first create a row and a schema and then a DataFrame based on the row and the schema.

Snowflake DataFrames

After having mastered the Hello World! stage, we now can query Snowflake tables using the DataFrame API. To do so, we will query the Snowflake Sample Database included in any Snowflake instance.

As you may know, the TPCH data sets come in different sizes from 1 TB to 1 PB (1000 TB). For starters we will query the orders table in the 10 TB dataset size. Instead of writing a SQL statement we will use the DataFrame API. The advantage is that DataFrames can be built as a pipeline. Let’s take a look at the demoOrdersDf.

val demoOrdersDf=session.table(demoDataSchema :+ "ORDERS")

In contrast to the initial Hello World! example above, we now map a Snowflake table to a DataFrame. The definition of a DataFrame doesn’t take any time to execute. It’s just defining metadata. To get the result, for instance the content of the Orders table, we need to evaluate the DataFrame. One way of doing that is to apply the count() action which returns the row count of the DataFrame. In this case, the row count of the Orders table. Another method is the schema function.

Next, we want to apply a projection. In SQL terms, this is the select clause. Instead of getting all of the columns in the Orders table, we are only interested in a few. This is accomplished by the select() transformation. Note that we can just add additional qualifications to the already existing DataFrame of demoOrdersDf and create a new DataFrame that includes only a subset of columns. Lastly, instead of counting the rows in the DataFrame, this time we want to see the content of the DataFrame. To do so we need to evaluate the DataFrame. We can do that using another action show.

Let’s now assume that we do not want all the rows but only a subset of rows in a DataFrame. We can accomplish that with the filter() transformation.

And lastly, we want to create a new DataFrame which joins the Orders table with the LineItem table. Again, we are using our previous DataFrame that is a projection and a filter against the Orders table. We can join that DataFrame to the LineItem table and create a new DataFrame. We then apply the select() transformation. Again, to see the result we need to evaluate the DataFrame, for instance by using the show() action.

Conclusion

Snowpark is a brand new developer experience that brings scalable data processing to the Data Cloud. In Part1 of this series, we learned how to set up a Jupyter Notebook and configure it to use Snowpark to connect to the Data Cloud. Next, we built a simple Hello World! program to test connectivity using embedded SQL. Then we enhanced that program by introducing the Snowpark Dataframe API. Lastly, we explored the power of the Snowpark Dataframe API using filter, projection, and join transformations.

In the next post of this series, we will learn how to create custom Scala based functions and execute arbitrary logic directly in Snowflake using user defined functions (UDFs) just by defining the logic in a Jupyter Notebook!

--

--