Snowpark for Python

Snowpark for Python- A secured and rich ecosystem.

What is Snowpark ?

Snowpark is a new development framework of Snowflake. The goal is to meet the developers where they are and to allow data engineers, data scientists, data developers to code in a familiar way with a language of their choice. To execute data pipelines, ML algorithms and data apps faster and more securely in a single platform inside Snowflake.The Snowpark library provides an intuitive library for querying and processing data at scale in Snowflake.

Snowflake currently provides Snowpark libraries for three languages: Java, Python, and Scala. Using a library for any of three languages, you can build applications that process data in Snowflake without moving data to the system where your application code runs, and process at scale as part of the elastic and server less Snowflake engine. In this blog we would primarily focus on Snowpark using Python.

Why Snowpark ?

Traditional & Streamlined Approach Comparison

As we can see from the above diagram the traditional approach has many challenges like :

  1. Running separate processing spark clusters for different languages.
  2. Complexity involved in capacity management.
  3. Lots of data movement and data silos.
  4. There are security and governance loopholes.

Now, all these challenges are solved by Snowflake with modern approach that is designed and has the below benefits:

  1. Single platform for all languages and no need to configure separate clusters.
  2. Simple and easy way to do the capacity management.
  3. Streamlining the architecture with collaborating on the same data hence mitigating the data silos.
  4. Robust governance and security controls.

Snowpark for Python:

Pic courtesy → Snowflake

Now, that we have an insight on what is Snowpark and how it enables the data ecosystem, it is essential for us to also understand how to use it using Python. As shown in the pic above Snowpark supports all familiar syntax with dataframe abstraction. We can easily access hundred of packages using Anaconda making it a rich ecosystem. And to add we get a very secure sandbox environment to work with it.

Pic courtesy → Snowflake

As shown in the above we can use Snowpark for Python in 3 different ways namely (1) Client API, (2) UDFs, (3) Stored Procs. In this blog we would be discussing more about Client API and how to set that up using Anaconda virtual environment.

Dataframe API query → Pic courtesy Snowflake

Using the dataframe API, we can query the Snowflake data using Python. These are very familiar dataframe API constructs with 100% push down support to Snowflake. We get all advantages of the platform like performance and scalability when we use the dataframe API.

What is a dataframe and Snowpark dataframe ?

Pic courtesy → Snowflake

A dataframe is a functional language that yields more efficiency with less coding effort. It has got a very concise syntax which can be easily understood and debugged.

Snowpark dataframe is a lazily evaluated relational data set. A dataframe is considered as lazy because it encapsulates the consumption to produce a relational data set. The computation is not performed until an action is explicitly invoked. This is discussed with practical examples in the subsequent section of this blog.

Detailed steps to configure Snowpark API using Jupyter notebooks:

Step 1 : The Snowpark API requires Python 3.8.

We can create a Python 3.8 virtual environment using tools like Anaconda, Miniconda, or virtualenv.

In the below example we would consider Anaconda as an enabler for creating this virtual environment. The command to install is given as below:

(base) C:\Users\user>conda create --name py38_env --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.8 numpy pandas
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

environment location: C:\Users\user\anaconda3\envs\py38_env

added / updated specs:
- numpy
- pandas
- python=3.8


The following packages will be downloaded:

package | build
---------------------------|-----------------
blas-1.0 | mkl 6 KB https://repo.anaconda.com/pkgs/snowflake
bottleneck-1.3.5 | py38h080aedc_0 117 KB https://repo.anaconda.com/pkgs/snowflake
ca-certificates-2023.01.10 | haa95532_0 158 KB https://repo.anaconda.com/pkgs/snowflake
certifi-2022.12.7 | py38haa95532_0 152 KB https://repo.anaconda.com/pkgs/snowflake

....,
....,
vs2015_runtime-14.27.29016 | h5e58377_2 2.2 MB https://repo.anaconda.com/pkgs/snowflake
wheel-0.38.4 | py38haa95532_0 77 KB https://repo.anaconda.com/pkgs/snowflake
wincertstore-0.2 | py38haa95532_2 15 KB https://repo.anaconda.com/pkgs/snowflake
------------------------------------------------------------
Total: 237.7 MB

The following NEW packages will be INSTALLED:

blas pkgs/snowflake/win-64::blas-1.0-mkl
bottleneck pkgs/snowflake/win-64::bottleneck-1.3.5-py38h080aedc_0
...,
wincertstore pkgs/snowflake/win-64::wincertstore-0.2-py38haa95532_2

Proceed ([y]/n)? y

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate py38_env
#
# To deactivate an active environment, use
#
# $ conda deactivate

Step 2 : Install the Snowpark Python package into the Python 3.8 virtual environment by using “conda” which is given as below:

(py38_env) C:\Users\user>conda install snowflake-snowpark-python
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

environment location: C:\Users\user\anaconda3\envs\py38_env

added / updated specs:
- snowflake-snowpark-python


The following packages will be downloaded:

package | build
---------------------------|-----------------
arrow-cpp-8.0.0 | py38hbd6f097_1 7.0 MB
asn1crypto-1.5.1 | py38haa95532_0 167 KB
aws-c-common-0.4.57 | ha925a31_1 147 KB
....,
....,
Downloading and Extracting Packages
requests-2.28.1 | 98 KB | ############################################################################################################################ | 100%

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Step 3 : Setting up of the Jupyter notebooks:

(py38_env) C:\Users\user>pip install notebook
Collecting notebook
Downloading notebook-6.5.3-py3-none-any.whl (529 kB)
---------------------------------------- 529.7/529.7 kB 1.8 MB/s eta 0:00:00
Collecting argon2-cffi
Downloading argon2_cffi-21.3.0-py3-none-any.whl (14 kB)
Collecting jupyter-client>=5.3.4
Downloading jupyter_client-8.1.0-py3-none-any.whl (102 kB)
---------------------------------------- 102.9/102.9 kB 3.0 MB/s eta 0:00:00

Step 4 : Starting a Jupyter notebook:

(py38_env) C:\Users\user>jupyter notebook

_ _ _ _
| | | |_ __ __| |__ _| |_ ___
| |_| | '_ \/ _` / _` | _/ -_)
\___/| .__/\__,_\__,_|\__\___|
|_|

Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.

https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html

Please note that updating to Notebook 7 might break some of your extensions.

[I 21:47:43.990 NotebookApp] Serving notebooks from local directory: C:\Users\user
[I 21:47:43.990 NotebookApp] Jupyter Notebook 6.5.3 is running at:
Starting a Jupyter Notebook

Step 5 : Creating a new Python worksheet using the Jupyter notebooks.

Once all the above steps i.e., from (1) to (4) is created we can create a Python notebook as shown below.

Creating a session for Snowpark:

This is usually done in 3 steps as given below:

Step 1 : Importing the Snowpark session.

Step 2: Constructing the dictionary for connection parameters.

Step 3: Invoking Snowpark session.

#importing the snowpark session.
from snowflake.snowpark import Session

# Constructing Dict for Connection Params
conn_config = {
"account": "VRZUKJU-JX06300",
"user": "XXXXXXXXXX",
"password": "XXXXXXXX",
"role" : "accountadmin",
"warehouse" : "compute_wh",
"database" : "demo_db",
"schema" : "demo_schema"
}

#Invoking Snowpark Session for Establishing Connection
new_session = Session.builder.configs(conn_config).create()
Snapshot from Jupyter notebook

I did face one challenge while connecting to Snowflake which is the format of the parameter that needs to be passed in “account” section. So below is what we need to do:

Setting up of the connection

Data frame Operations using Snowpark:

Now that all the set-up is ready below are some of the amazing dataframe things which we can do using the Snowpark framework. There are primarily 2 categories of operations :

  1. Transformation: This would produce a new dataframe from one or more existing dataframes. Examples of it can be methods like ‘filter’, ‘sort’, ‘join’, etc..
  2. Actions: Because the dataframe has to be evaluated hence when you call a method that performs an action Snowpark sends the SQL query for the dataframe to the server for evaluation. Examples of it can be df.show(), df.collect(), df.count(), etc..

Some examples of operations using Snowpark are as below:

Operation 1 → Create a dataframe from SQL query.

#Create a DataFrame from a SQL query
query_1 = new_session.sql("select * from demo_cust_tbl limit 5")
query_1.show()
The o/p of the above dataframe query.

Operation 2 → To create a dataframe from a complete table present in Snowflake.

#To create a dataframe from a table present in Snowflake.
df_table = new_session.table("demo_cust_tbl")
df_table.show()
The o/p

Operation 3 → To create a dataframe with one column named from a specified list of values.

#To create a dataframe with one column named from a specified list of values.
df2 = new_session.create_dataframe([1, 2, 3, 4]).to_df("a")
df2.show()
The o/p

Operation 4 → To Specify how the rows would be returned from a table. We need to use the ‘filter’ method. Over here the transformation is ‘filter’ and the action is ‘show()

from snowflake.snowpark.functions import col
df = new_session.table("demo_cust_tbl").filter(col("C_CUSTOMER_SK") == 69912739)
df.show()
The o/p

Operation 5 → Specify which columns can be selected. We need to call the ‘select’ method. Over here the transformation is ‘select’ method and action is the ‘show()’

from snowflake.snowpark.functions import col
df = new_session.table("demo_cust_tbl").select(col("C_CUSTOMER_SK"), col("C_CUSTOMER_ID"), col("C_CURRENT_CDEMO_SK"), col("C_CURRENT_HDEMO_SK"))
df.show()
The o/p

Snowpark code in Python Worksheets:

This is a very recently launched feature of Snowflake. As I write and publish this blog it is a ‘preview feature’ by Snowflake. With this release now we can create the python worksheets within Snowflake using Snowsight.

Some basic steps on how to create those worksheets & use them:

Step 1 : Create a Python worksheet.

Using Snowsight Python worksheets are created

Step 2 : Write the Snowpark Code.

Snowpark code within Python worksheets

If you see this we can see that already the packages are included with Anaconda and we can start writing the code as soon as we create these worksheets. There is also a provision to add more libraries to the python worksheets.

Step 3 : Deploying the Snowpark code as a stored procedure

Above is how we can create “Stored Procedure” out of the Python code by just one click. Snowflake would automatically wrap this Python code within a stored procedure and create a new SQL Worksheet which then can be executed.

More reads:

https://docs.snowflake.com/en/developer-guide/snowpark/python/setup

THANK YOU !!

Please keep reading my blogs it is only going to encourage me in posting more such content. You can find me on LinkedIn by clicking here and on Medium here. Happy Learning :)

Awarded as “Data Superhero by Snowflake for year 2023”, click here for more details.

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

--

--

Somen Swain
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Snowflake Data Superhero 2024 & 2023 | 4XSnowpro Certified | AWS Solution Architect Associate | Cloud Computing| Principal-Data Engineering at LTIMindtree