Snowpark Python Advanced: How to navigate your projects environment dependencies and “import” your existing python libraries into Python UDFs.

TLDR:

  1. Move non essential library imports into the function itself to reduce the projects’ package dependency list
  2. Clear __init__.py files that propagate up specific python objects in the module … from mymodule.file import func, class at all necessary folder levels.

DISCLAIMERS:

(1) This article assumes that you have been working with Snowflake’s Snowpark for Python feature ecosystem (API, UDFs, Stored Procedures, etc.) and have done so with adding 3rd party packages provided by our embedded environment management partner, Anaconda.

(2) I am an active contributor to the open-source project: feast. To give readers the feel of working in an actual decent size codebase, the snippets I show will relate to this project … Im also recovering a little bit from the pain of having to do this discovery on my own, so hopefully I can save you.

BACKGROUND:

The reality is, most of the python code you care about today lives within a version control system (like Github). And chances are, your production python code is organized in python’s standard structure that allows for that code to be packaged up into a .whl feel for easy distribution (publicly, “pip install”).

Typically, you have a requirements.txt file. In this case, requirements is a folders that has a .txt file for each version of python supported (3.7, 3.8, 3.9) etc.

Here we can see the essentials: setup.py, requirements, readme.md, and feast. The name of my project where I am adding Snowpark Python support is called feast, and so the folder feast contains all of the source code. If you wanted to get to trying out the package and testing new code changes simultaneously, all you would need to run is:

pip install -e . ”

and setup.py would install the right requirements.txt and you would be good to go.

Look at a few of those delicious dependencies.

PROBLEM:

When users are working with python, majority of the time there is no worry as it pertains to security. As easy as it is to pip install a package, the same goes for potential installing malicious packages. Snowflake unfortunately cant take that risk and must restrict the access to a finite list of python packages that have been deemed safe through the joint partnership with Anaconda.

The result of this is the highly probable chance you are always going to be missing “that one” dependency needed to fit your python project into Snowflake Python UDFs.

This article is to help bypass dependency hell as best as we can.

LETS DIG IN:

(1) Confirm code package dependencies and adjust imports where it’s possible.

This is a Python Batch UDF that I want to use.

Within my feast project that I am looking to contribute to, I want to give feast the ability to use this function of converting a Snowflake binary data type to Google protobuf in a Snowflake UDF instead of pulling the data down to my local laptop and running the function there.

We can see this function relies on the following dependencies and functions (starting from top to bottom):

  1. @vectorized()
  2. pandas.DataFrame
  3. ValueProto.SerializeToString
  4. python_values_to_proto_values()
  5. ValueType.BYTES

For the first two, the good thing is @vectorized() is native to Snowflake as this is how it turns regular scalar python UDFs into batch UDFs. For pandas, pandas is one of the available packages via the Snowflake Anaconda channel.

The latter 3 come from other python files within the feast project.

SIDE NOTE: The ideal best practice here as you look to move your code into Snowflake is to just .zip up the entire project. What we don’t want is for you to create a “second” side-cart project that has been skimmed down to make this work in Snowflake. No one likes having to replicate code.

Remember, our terminals current path is set to have our feast folder to be discoverable.

Narrowing in, we have dependencies on code coming from different parts of our project.

Lets look at where ValueType comes:

It a custom class with more dependencies. Get why this can spiral out of hand pretty quickly?

Because we are looking to import ValueType into our Python UDF from this class creation file, there is nothing we can do other than to verify that these packages also are available inside of Snowflake as well. The packages enum and typing ship natively with vanilla python, so we are in the clear. But again, another import from another feast specific code path.

The code “from feast.protos.feast.types.Value_pb2” has now occurred twice, once in our main Python UDF file and again through a helper file. Lets look:

We can see we finally are rid of code imports for other parts of feast. To our delight, Snowflake UDFs allow users to import the protobuf package.

Looking back at our initial imports, 2 out of 3 initial imports will work for us.

For from feast.type_map import python_values_to_proto_values, we need to go look at that specific file:

Looking at the imports, we finally see an unsupported library for Snowflake Python UDFs: pyarrow

The main question to answer is: does the function import we need python_values_to_proto_values actually require pyarrow?

  1. Yes — hate to break it to you, you will not be able to reuse this function as it stands, look to create a new implementation without the unsupported package.
  2. No — move the package import statement into the function(s) itself if it makes sense.

In our example, pyarrow is not used in python_values_to_proto_values, it is only used in another function within the same type_map.py called feast_value_type_to_pa. This is a spot where it makes since to move the pyarrow import into the function itself. No harm caused.

(2) Clear out __init__.py files that propagate up specific python objects in the module when pushing the code to Snowflake.

Going back to our original new python file:

There is one other potential thing to trip us up.

When importing from other python files within the repo, as we just inspected, this triggers each of the corresponding __init__.py files per module to execute. This is a core principle of python, and what makes a collection of .py files and actual python module.

See this link for a deep dive into __init__.py

Here is a snippet of the __init__.py file that sits at feast/__init__.py (the top most directory of the project)

This allows someone to do “from feast import SnowflakeSource” as opposed to “from feast.offline_stores.snowflake_source import SnowflakeSource”

For example, when we go to import from feast.infra.key_encoding_utils import serialize_entity_key in our Snowflake UDF file, the __init__.py file in both feast/__init__.py and feast/infra/__init__.py are triggered.

Because of this, we now have to verify that all of the dependencies to import specific project objects in our __init__.py file also exist in our Snowflake packages list. This is where things get out of hand.

So what are my options now if these type of imports exists in my __init__.py files?

Blank them out … don’t delete the file, rather just have empty __init__.py files and rely on absolute import statements in your Snowpark Python UDF file. This will absolve the issue when you go to make imports in your Snowflake UDF python file.

it’s blank.

Over time as Snowflake broadens its python package ecosystem, these two issue highlighted when doing Snowpark Python development will become less and less an issue. Snowpark Python after all is only 2 months old at this point.

While it’s much easier to accomplish the first issue in your code by moving import statements into functions, the latter on __init__.py files does require some “deployment” work.

How I have solved this problem today is through Github Actions. On a commit to master or PR merge, simply have a program that wipes out the proper __init__.py files before you go to upload and register your Snowpark Python UDFs. Check out this medium post on more of those details.

--

--