How Does It Work? Python Package Versions, Snowpark and How to Avoid Dependency Purgatory

Photo by Brett Jordan on Unsplash

TLDR: Be specific with your package and Python versions whenever you’re running Python in Snowpark

Specifying Python package versions matters. Uncontroversial statement? Sure, but despite good backward compatibility in most of the Python ecosystem, it’s worth stating. Whether it’s new functionality, a changed API, or different performance derived from abstracting/harnessing a fast C++ implementation, every change could result in failed (or worse yet) different outputs.

Fortunately, most of the time a version issue occurs, you will be told about it in stark terms… Unfortunately, stark doesn’t always mean easy to interpret. You might get an error about a parameter you’ve never had to set before or a parameter that doesn’t even exist (but that you’re certain does*). This isn’t quite dependency hell, but it feels a bit like it sometimes. Let’s call it “dependency purgatory”.

This guide aims to help you avoid these issues. By delving into the inner workings of Snowpark’s package management, I’ll provide you with a clear path to error-free Python Stored Procedures (SProcs), User Defined Functions (UDFs), and Model Registrations. With Snowpark, you can navigate through ‘dependency purgatory’ with ease.

* at least in the version you think you’re running…

A Brief Primer

When we talk about Snowpark, there are two different but connected components:

  1. Client Side: where you author all your instructions for Snowflake, often a local environment where you are writing Python
  2. Server Side: where those instructions get executed inside Snowflake
Snowpark’s Client and Server Side, at the highest level of abstraction

I describe this pair as a puppeteer act, where you pull the strings on the Client Side, but the actual movement or compute happens on the Server Side.
Go a level deeper, and we can see a little more going on. For example, if you write Snowpark DataFrame operations, they will feel a lot like local Python (and Spark more generally, given the near identical syntax), but under the hood, it’s all getting converted to SQL. You can even see this in your Snowflake GUI, where the translated instruction is observable as SQL.
On the other hand if you are writing SProcs or UDFs it’ll get pushed down into a Python Secure Sandbox.

Note there is a bit of overlap in the concepts as a Dataframe might be manipulated with a UDF

Snowpark does this by:

  1. Inferring the Python version from your Client Side environment
  2. Serialising the code locally using cloudpickle on the Client Side
  3. Sending the serialised code to the Server Side for execution

What exactly is serialised depends on how you author the SProc or UDF, and it’s the “what” that presents the potential problems.

What Can Go Wrong?

Simply put — version mismatches between the environment the code is designed to run in (inferred from the Client Side) and the environment the code will actually run in (the Snowpark Side). Often, nothing terrible happens, though you might encounter a warning like this:

A typical warning from Snowpark, given it’s numpy and most of the vanilla stuff doesn’t change often you’ll probably get away with ignoring the warning. Probably.

Generally speaking, the occasional pink warning in a Jupyter Notebook isn’t enough to put me off, especially if the code still ran. 99% of the time, the code does what it’s supposed to do despite this warning. This breeds indifference, even complacency…

Unfortunately, sometimes, this mismatch causes a breaking issue. The root cause will be some change in the package that impacts the code execution. An example I see pretty often is Random Forests in Scikit-Learn, which have had material changes over the lifetime of the Scikit-Learn package. For example, running this code will get you the following error with sci-kit learn 1.2.2 in your environment will get you this error:

snowflake.snowpark.exceptions.SnowparkSQLException: (1304): 01b30194-0000-d2e9-0000-f149006ed352: 100357 (P0000): Python Interpreter Error:
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')] in function CreateModule-b166ece4-cf07-4487-9be0-9be9edc77cc5 with handler predict.infer

because “missing_go_to_left” isn’t in Scikit-Learn 1.2.2.

How To Avoid Purgatory

There are five main scenarios to consider, illustrated with SProcs (but UDFs would be the same). I’ve crudely ranked them here from best to worst:

1. Register Version Specified + Import In SProc

Top of the class! This is the best thing you can do, and for extra credit, make sure you set the version to the same as that in the Client Side environment (in this case, make sure your Client Side environment also has numpy 1.26.4).

@sproc(name='my_sproc', 
packages=['snowflake-snowpark-python==1.13.0','numpy==1.26.4'],
is_permanent=True,
replace=True,
stage_location='@PIPELINE',
session=session)
def my_sproc(session: Session) -> T.Variant:
import numpy
result = numpy.mean([10,11,12])
return str(result)

my_sproc()

2. Register Version From Local + Import In SProc

Pretty good, Snowpark will serialise the local package as part of the register process, so you won’t have any mismatch issues. Note we’re not specifying ‘numpy’ which would direct to our Anaconda channel, we’re importing numpy which is a local library within your client side environment.

import numpy # numpy local
@sproc(name='my_sproc',
packages=['snowflake-snowpark-python', numpy], # refer to numpy not 'numpy'
is_permanent=True,
replace=True,
stage_location='@PIPELINE',
session=session)
def my_sproc(session: Session) -> T.Variant:
result = numpy.mean([10,11,12]) # refers to the local version
return str(result)

my_sproc()

Note I wouldn’t recommend this if you can do 1., it is specific, but it adds unnecessary points of failure along the way (since you can just import numpy in the SProc).

3. Register Version Unspecified + Import In SProc

Still pretty good. Snowpark will assume you want the latest version available (extra credit as before, specify the same version as in the Python environment). The downside is that because you’re letting Snowpark choose the latest version, when that latest version changes, something could go wrong (though SProcs are immutable, you might deploy a SProc again as part of a pipeline, thereby replacing it).

@sproc(name='my_sproc', 
packages=['snowflake-snowpark-python','numpy'],
is_permanent=True,
replace=True,
stage_location='@PIPELINE',
session=session)
def my_sproc(session: Session) -> T.Variant:
import numpy
result = numpy.mean([10,11,12])
return str(result)

my_sproc()

4. Register Version Specified + Import Outside of SProc

Uh-oh, you’re at risk here. You’re coding like you want to locally serialise the package like in 3. but you’re telling Snowflake to use something different for the Secure Sandbox. If your local version and the most up to date version in the Snowpark Anaconda Channel are the same you’re ok, but you’re playing with fire.

import numpy
@sproc(name='my_sproc',
packages=['snowflake-snowpark-python','numpy'],
is_permanent=True,
replace=True,
stage_location='@PIPELINE',
session=session)
def my_sproc(session: Session) -> T.Variant:
result = numpy.mean([10,11,12])
return str(result)

my_sproc()

5. Register Version Specified + Import Outside of SProc

Just as bad as 4. the only difference being that you’re specifying a version.

import numpy
@sproc(name='my_sproc',
packages=['snowflake-snowpark-python','numpy==1.26.4'],
is_permanent=True,
replace=True,
stage_location='@PIPELINE',
session=session)
def my_sproc(session: Session) -> T.Variant:
result = numpy.mean([10,11,12])
return str(result)

my_sproc()

Special Case: The Model Registry

The Model Registry is a special case of the above. As part of the model build process, the Snowpark ML package will create a base model from the Client Side version. The versions don’t usually matter for models that haven’t changed much in the last few years, like Logistic Regression (that’s not an excuse not to specify). However, packages like xgboost and sklearn’s random forest implementation have changed over their lifetime and a version mismatch may cause as serious problem.

mv = reg.log_model(
clf,
model_name="my_model",
version_name="v1",
conda_dependencies=["scikit-learn==1.3.0"], # make sure this version is also in your local machine
comment="My awesome ML model",
metrics={"score": 96},
sample_input_data=train_features)

Bonus Tips

Build Your Environment With the Snowpark Channel

Since it all starts with your Client-Side environment, build it with versions that you know are available by checking in the INFORMATION_SCHEMA within a Snowflake database under packages (which will tell you the supported Python and package version combos) or do a quick check with Snowpark’s Anaconda Channel if you’re in a hurry and don’t need the details.

This build, via conda, will look something like this:

>> conda create --name some-demo-env -c https://repo.anaconda.com/pkgs/snowflake python=3.11
>> conda activate some-demo-env
>> conda install -c https://repo.anaconda.com/pkgs/snowflake snowflake-snowpark-python==1.12.0 pandas==2.1.4 snowflake-ml-python==1.2.0

Server Side Serialisation

Serialisation only happens on the Server Side when the same function is wrapped in an SQL statement. This means you can also specify the Python version, but it comes at the cost of a more clunky UX and more headaches when running the code locally as a test.

CREATE OR REPLACE PROCEDURE my_sproc()
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python','numpy')
HANDLER = 'my_sproc'
AS
$$
import numpy
def my_sproc(session: Session) -> T.Variant:
result = numpy.mean([10,11,12])
return str(result)
$$

Or in Python (which you could run Client Side)

session.sql('''CREATE OR REPLACE PROCEDURE my_sproc()
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python','numpy')
HANDLER = 'my_sproc'
AS
$$
import numpy
def my_sproc(session: Session) -> T.Variant:
result = numpy.mean([10,11,12])
return str(result)
$$''').collect()

Note: Snowsight worksheets and Notebooks (coming soon) will also count as Server Side.

Doing it in dbt

If you’re using dbt, you will be working in a sort of hybrid between the SQL and Python scenarios. Specifically, dbt’s design philosophy avoids code running locally, so you can specify the Python Version. It’ll look something like this:

def model(dbt, session):
dbt.config(
materialized = "table",
python_version="3.11"
)

More details here, a nod to Ernesto Ongaro for his help on this (and his swift updates on the docs!).

Requesting a Package

If you don’t want to import something directly (another blog post will follow shortly on this), you can also request the addition of new packages. Simply go to the Snowflake Ideas page in the Snowflake Community. Select the Python Packages & Libraries category and check if someone has already submitted a request. If so, vote on it. Otherwise, click New Idea and submit your suggestion.

Coming Soon

As with any blog post, this is a snapshot in time. Some of this will be old news with the advent of the custom package usage session parameter, which will enable direct pip imports. For now, this is an experimental feature, but it will make some of this headache disappear.

Thanks for reading!

Thanks to Mats Stellwall, Michael Gorkow, Chase Ginther, Sri Chantala and Yijun Xie for their inputs.

--

--

Michael Taylor
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Equal parts data scientist, consultant, data privacy wonk, animal lover, basketball coach/player and cook. Thoughts are my own, not my employers