Getting Started with Snowpark using Docker
This article describes how to easily set up your Snowpark environment using Docker and Jupyter notebooks. We also provide guidelines for when Snowpark may be especially useful for your organization.
Create a Docker image contains both Snowpark and Jupyter notebooks
We will show you a straightforward way to start developing solutions in Snowpark. First, create a Docker image that contains both Snowpark and Jupyter notebooks. A simplified docker file might look like this:
FROM jupyter/minimal-notebook:latest
RUN conda create — name snowpark — override-channels -c https://repo.anaconda.com/pkgs/snowflake -y python=3.8 \
snowflake-snowpark-python \
numpy \
pandas
SHELL [“conda”, “run”, “-n”, “snowpark”, “/bin/bash”, “-c”]
RUN pip install jupyter ipython && \
ipython kernel install — name “snowpark” — user
We are going to use jupyter/minimal-notebook as the base image. Then we are going to create an Anaconda environment dedicated to Snowpark using the channel provided by Snowflake. Remember to use Python 3.8 — it is required by Snowpark. Snowflake’s Anaconda channel guarantees that all imported external dependencies (e.g. numpy, pandas, etc.) are in the versions available in Snowflake’s execution environments. The complete list of packages can be checked in the Anaconda repository (https://repo.anaconda.com/pkgs/snowflake/) or by running the following query in Snowflake:
select * from information_schema.packages where language = 'python';
After creating the Anaconda environment, we activate it, install the required packages to use Jupyter, and finally add the environment as the Jupyter kernel. You can build the Docker image by running the standard command from the directory with Dockerfile:
docker build -t snowpark-notebook:latest .
Then, you can run the built image:
docker run -it -p 10000:8888 snowpark-notebook:latest
Jupyter notebook will be available on port 10000 in the machine running the Docker image. You can access it through its local interface, for example:
http://localhost:10000/lab?token={token_written_in_the_console}
In the list you should see Snowpark kernel:
Now you are ready to connect to Snowflake and begin to explore the possibilities it has to offer:
Code:
from snowflake.snowpark import Session
connection_parameters = {
"account": "",
"user": "",
"password": ""
}
sf_session = Session.builder.configs(connection_parameters).create()
sf_session.sql('SELECT current_version()').collect()
Last but not least, we would like to point out when Snowpark might be very useful for you. These are scenarios in which a piece of processing logic can be implemented by user-defined functions that Snowpark installs in Snowflake. Imagine you have the following pipeline:
1. Query data from Snowflake
2. Collect results in a Python application
3. Apply processing logic (e.g. ML preparation, scoring)
4. Transfer the output to Snowflake
The process requires you to transfer the entire data set twice over the Internet. In addition, the Python application can become a bottleneck when running on a single machine. If you could move all or part of the processing to UDFs, the pipeline would run entirely in Snowflake. The UDF processing logic would scale virtually indefinitely within the Snowflake execution environment.
Check out the official Snowpark Documentation to learn more about how to make this happen:
https://docs.snowflake.com/en/developer-guide/snowpark/python/setup.html