Snowpark on Wheels (or the story about using .whl files in Snowpark)

Mauricio Rojas
6 min readMay 27, 2024

--

This blog post is about how to leverage .whl files with Snowpark.

First some background…

Snowpark procedures execute on a secure sandbox environment to guarantee the integrity of your data.

To enhance the functionality of your Python code, you’ll need to install additional packages. Python offers a vast repository known as PyPi for this purpose. However, it’s important to note that PyPi alone does not provide any guarantees regarding the quality or reliability of these packages.

That is why Snowflake partnered with Anaconda to ensure that you have available a list of curated packages.

What happens if I need code that is not in Anaconda?

In that case you have several options.

Snowpark allows you to add imports references to code on an stage.

You can then for example point directly to script files on an stage or in a git repository with git integration.

Diagram from snowflake docs showing Git repository exchanging files with development tools and Snowflake.

You can also use the Snow CLI to repackage your code as a zip bundle and take advantage of Python's ZipImport. This method enables you to add zip files directly to your sys.path. The SnowCLI facilitates this process, and you can find an informative post detailing this approach.

However ZipImport has limitations.

Zip Imports Limitations

When you use ZIP files to distribute Python code, you need to consider a few limitations of Zip imports:

  • Loading dynamic files, such as .pyd, .dll, and .so, isn’t possible.
  • Using __path__, __file__ or __package__ to build relative path might not work.
  • Some relative imports in your package will also fail.

Modules or packages sometimes need to be distributed with non-code data such as images, configuration files, and default data. Typically, the __path__ attribute of the module is used to locate these files relative to the installed code.

If the module performs some code like:

import example_package
data_filename = os.path.join(os.path.dirname(example_package.__file__),
'README.txt')

The __file__ of the package refers to the ZIP archive, and not a directory. So we cannot just build up the path to the README.txt file.

Instead, when dealing with ZIP archives, access the file using the get_data() method of the module's loader:

import sys
sys.path.insert(0, 'zipimport_example.zip')
import os
import example_package
print(example_package.__file__)
print(example_package.__loader__.get_data('example_package/README.txt'))

This way it is possible to read data embedded within a ZIP archive, such as configuration files or images.

Note that the __loader__ attribute is only set for modules loaded via zipimport, making this approach specific to such scenarios.

You might need to add conditional logic to use the file locally or from a zip import.

import os
import sys

def load_data_from_package(package, filename):
"""
Load data from a given package, handling both filesystem and ZIP imports.

:param package: Module object of the package.
:param filename: Name of the file to load data from.
:return: Data from the file.
"""
try:
# Attempt to construct the file path directly
data_path = os.path.join(os.path.dirname(package.__file__), filename)
if os.path.exists(data_path):
with open(data_path, 'rt') as file:
return file.read()
else:
# Path does not exist, likely a zip import
raise FileNotFoundError
except (AttributeError, FileNotFoundError):
# __file__ is not set or path doesn't exist, use the loader
if hasattr(package, '__loader__'):
return package.__loader__.get_data(filename).decode('utf-8')
else:
raise Exception(f"Failed to load data from {filename}")

# Example usage
import example_package
data = load_data_from_package(example_package, 'README.txt')
print(data)

In those scenarios you might still consider using a python wheel.

How can I use python wheels 🐍 🛞🛞?

Let's return to the whole topic of the article.

Yes, many people use wheels are part of their development cycle. For example as part of their CI/CD pipeline:

# Github actions to build
# and push wheel files
on:
push:
branches:
- main
- master
jobs:
build_wheel:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Build wheel and install
run: |
python -m pip install --user --upgrade build
python -m build
#pip install .
find ./dist/*.whl | xargs pip install
python simple_test.py
- name: Configure Git
run: |
git config --global user.email "someone@email.com"
git config --global user.name "someone"
- name: Commit and push wheel
run: |
git add -f ./dist/*.whl
git commit -m 'pushing new wheel'
git push

And wheels can be used* inside of an Snowpark procedure.

In general a wheel is just a zip with your code. The steps you need are:

  • open the wheel file
  • uncompress it into a folder
  • add that folder to your sys.path

This simple steps that which are already coded in this wheel_loader.py that you can check in github.

An Example using wheels

Let's create an example. I will use the pydub library which can help to extract audio from a video. The ffmpeg library is available in Snowpark Anaconda but pydub is not yet there.

To create the sample first run snow snowpark init wheels_sample
This will initialize a snowpark project with some boilerplate code. In your project folder you will find some python files under the app folder. We only need the procedures.py file. Lets change it by something like this:

import wheel_loader
wheel_loader.add_wheels()

from snowflake.snowpark.files import SnowflakeFile
from snowflake.snowpark import Session
from pathlib import Path
import os

def extract_audio_from_video(session:Session, filename:str):
from pydub import AudioSegment
stage_location = f"@mystage"
only_filename, ext = os.path.splitext(os.path.basename(filename))
with SnowflakeFile.open(filename,"rb",require_scoped_url=False) as f:
audio = AudioSegment.from_file(f, format="mp4", codec="aac")
audio = audio.set_channels(1).set_frame_rate(16000).set_sample_width(2)
audio.export(f"/tmp/{only_filename}.wav", format="wav")
session.file.put(f"/tmp/{only_filename}.wav", stage_location, overwrite=True,auto_compress=False)
return "@mystage/%s.wav" % only_filename


if __name__ == "__main__":
from unittest.mock import patch
# For testing locally we mock the SnowflakeFile to return a handle to a local file
with Session.builder.getOrCreate() as session:
with patch("snowflake.snowpark.files.SnowflakeFile.open", return_value=open(os.path.expanduser("~/Downloads/big_buck_bunny_720p_1mb.mp4"), "rb")):
print(extract_audio_from_video(session,"@mystage/big_buck_bunny_720p_1mb.mp4")) # type: ignore

Notice the two lines at the top. That is all the code you need to add to use those whl files.

I will use pip to download the .whl file to a local folder.

pip download pydub -d packages_folder

Please notice that pip might download not only your needed .whl but also other dependencies as well. Some of which might already be in anaconda. You can use a simple script to determine for which dependencies you need a .whl:

#!/bin/bash
# Define the folder containing your files
FOLDER="packages_folder"
# Change to the folder
cd "$FOLDER" || exit
# Iterate over each file in the folder
for file in *; do
# Extract the package name (the part before the first '-')
package_name=$(echo "$file" | sed 's/-.*//')
# Run the snow snowpark package lookup command
snow snowpark package lookup "$package_name"
done

In this case there is only file:

folder with the .whl file

Next we upload the package to an stage:

And adjust the snowflake.ymland requirements.txt files:

The next steps are easy.
Build your project:

Deploy:

And run:

And Voila, you have your code running and using a library from a wheel!!

NOTE: Adding .whl does not allow using any binary dependencies, that is still restricted. However if you have some libraries you already developed as pure-python .whl files you might consider this approach.

Summary

Thank you for reading this post. I hope the explanations provided here offer a clearer understanding of the options available for leveraging additional libraries in Snowpark, and how the wheel_loader.py script can assist you. While this approach doesn’t support the use of binary dependencies, it does overcome some of the limitations associated with zip imports. I plan to illustrate more applications of this technique in future posts, and I hope you found this one both informative and entertaining.

If you have any questions, please post them in the comments below or contact me on LinkedIn.

--

--

Mauricio Rojas

Engineer with a passion for software design & AI. Expert in Java, C++, & big data. MSc in Computer Science. Currently exploring SnowPark.