Snowpark on Wheels (or the story about using .whl files in Snowpark)
This blog post is about how to leverage .whl files with Snowpark.
First some background…
Snowpark procedures execute on a secure sandbox environment to guarantee the integrity of your data.
To enhance the functionality of your Python code, you’ll need to install additional packages. Python offers a vast repository known as PyPi for this purpose. However, it’s important to note that PyPi alone does not provide any guarantees regarding the quality or reliability of these packages.
That is why Snowflake partnered with Anaconda to ensure that you have available a list of curated packages.
What happens if I need code that is not in Anaconda?
In that case you have several options.
Snowpark allows you to add imports references to code on an stage.
You can then for example point directly to script files on an stage or in a git repository with git integration.
You can also use the Snow CLI to repackage your code as a zip bundle and take advantage of Python's ZipImport. This method enables you to add zip files directly to your sys.path. The SnowCLI facilitates this process, and you can find an informative post detailing this approach.
However ZipImport has limitations.
Zip Imports Limitations
When you use ZIP files to distribute Python code, you need to consider a few limitations of Zip imports:
- Loading dynamic files, such as
.pyd
,.dll
, and.so
, isn’t possible. - Using
__path__
,__file__
or__package__
to build relative path might not work. - Some relative imports in your package will also fail.
Modules or packages sometimes need to be distributed with non-code data such as images, configuration files, and default data. Typically, the __path__
attribute of the module is used to locate these files relative to the installed code.
If the module performs some code like:
import example_package
data_filename = os.path.join(os.path.dirname(example_package.__file__),
'README.txt')
The __file__ of the package refers to the ZIP archive, and not a directory. So we cannot just build up the path to the README.txt file.
Instead, when dealing with ZIP archives, access the file using the get_data()
method of the module's loader:
import sys
sys.path.insert(0, 'zipimport_example.zip')
import os
import example_package
print(example_package.__file__)
print(example_package.__loader__.get_data('example_package/README.txt'))
This way it is possible to read data embedded within a ZIP archive, such as configuration files or images.
Note that the __loader__
attribute is only set for modules loaded via zipimport, making this approach specific to such scenarios.
You might need to add conditional logic to use the file locally or from a zip import.
import os
import sys
def load_data_from_package(package, filename):
"""
Load data from a given package, handling both filesystem and ZIP imports.
:param package: Module object of the package.
:param filename: Name of the file to load data from.
:return: Data from the file.
"""
try:
# Attempt to construct the file path directly
data_path = os.path.join(os.path.dirname(package.__file__), filename)
if os.path.exists(data_path):
with open(data_path, 'rt') as file:
return file.read()
else:
# Path does not exist, likely a zip import
raise FileNotFoundError
except (AttributeError, FileNotFoundError):
# __file__ is not set or path doesn't exist, use the loader
if hasattr(package, '__loader__'):
return package.__loader__.get_data(filename).decode('utf-8')
else:
raise Exception(f"Failed to load data from {filename}")
# Example usage
import example_package
data = load_data_from_package(example_package, 'README.txt')
print(data)
In those scenarios you might still consider using a python wheel.
How can I use python wheels 🐍 🛞🛞?
Let's return to the whole topic of the article.
Yes, many people use wheels are part of their development cycle. For example as part of their CI/CD pipeline:
# Github actions to build
# and push wheel files
on:
push:
branches:
- main
- master
jobs:
build_wheel:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Build wheel and install
run: |
python -m pip install --user --upgrade build
python -m build
#pip install .
find ./dist/*.whl | xargs pip install
python simple_test.py
- name: Configure Git
run: |
git config --global user.email "someone@email.com"
git config --global user.name "someone"
- name: Commit and push wheel
run: |
git add -f ./dist/*.whl
git commit -m 'pushing new wheel'
git push
And wheels can be used* inside of an Snowpark procedure.
In general a wheel is just a zip with your code. The steps you need are:
- open the wheel file
- uncompress it into a folder
- add that folder to your
sys.path
This simple steps that which are already coded in this wheel_loader.py that you can check in github.
An Example using wheels
Let's create an example. I will use the pydub library which can help to extract audio from a video. The ffmpeg library is available in Snowpark Anaconda but pydub is not yet there.
To create the sample first run snow snowpark init wheels_sample
This will initialize a snowpark project with some boilerplate code. In your project folder you will find some python files under the app
folder. We only need the procedures.py
file. Lets change it by something like this:
import wheel_loader
wheel_loader.add_wheels()
from snowflake.snowpark.files import SnowflakeFile
from snowflake.snowpark import Session
from pathlib import Path
import os
def extract_audio_from_video(session:Session, filename:str):
from pydub import AudioSegment
stage_location = f"@mystage"
only_filename, ext = os.path.splitext(os.path.basename(filename))
with SnowflakeFile.open(filename,"rb",require_scoped_url=False) as f:
audio = AudioSegment.from_file(f, format="mp4", codec="aac")
audio = audio.set_channels(1).set_frame_rate(16000).set_sample_width(2)
audio.export(f"/tmp/{only_filename}.wav", format="wav")
session.file.put(f"/tmp/{only_filename}.wav", stage_location, overwrite=True,auto_compress=False)
return "@mystage/%s.wav" % only_filename
if __name__ == "__main__":
from unittest.mock import patch
# For testing locally we mock the SnowflakeFile to return a handle to a local file
with Session.builder.getOrCreate() as session:
with patch("snowflake.snowpark.files.SnowflakeFile.open", return_value=open(os.path.expanduser("~/Downloads/big_buck_bunny_720p_1mb.mp4"), "rb")):
print(extract_audio_from_video(session,"@mystage/big_buck_bunny_720p_1mb.mp4")) # type: ignore
Notice the two lines at the top. That is all the code you need to add to use those whl files.
I will use pip to download the .whl file to a local folder.
pip download pydub -d packages_folder
Please notice that pip
might download not only your needed .whl
but also other dependencies as well. Some of which might already be in anaconda. You can use a simple script to determine for which dependencies you need a .whl
:
#!/bin/bash
# Define the folder containing your files
FOLDER="packages_folder"
# Change to the folder
cd "$FOLDER" || exit
# Iterate over each file in the folder
for file in *; do
# Extract the package name (the part before the first '-')
package_name=$(echo "$file" | sed 's/-.*//')
# Run the snow snowpark package lookup command
snow snowpark package lookup "$package_name"
done
In this case there is only file:
Next we upload the package to an stage:
And adjust the snowflake.yml
and requirements.txt
files:
The next steps are easy.
Build your project:
Deploy:
And run:
And Voila, you have your code running and using a library from a wheel!!
NOTE: Adding .whl does not allow using any binary dependencies, that is still restricted. However if you have some libraries you already developed as pure-python .whl files you might consider this approach.
Summary
Thank you for reading this post. I hope the explanations provided here offer a clearer understanding of the options available for leveraging additional libraries in Snowpark, and how the wheel_loader.py script can assist you. While this approach doesn’t support the use of binary dependencies, it does overcome some of the limitations associated with zip imports. I plan to illustrate more applications of this technique in future posts, and I hope you found this one both informative and entertaining.
If you have any questions, please post them in the comments below or contact me on LinkedIn.