Let’s talk about some “better practices” with Snowpark Python, Python UDFs and Stored Procs

Date: September 2022

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Snowpark Python has been in Public Preview since June 2022 (expected to be Generally Available for production use in a couple of months). As Snowpark Python developer community is growing rapidly, more developers are looking for “best practices”. While it might be a little early to talk about best practices before Snowpark is Generally Available, we can explore some of the “better practices” with Snowpark, Python UDFs and Python Stored Procedures, and how to use all the constructs to work together to make Python development in Snowflake easier, more efficient and secure.

Snowpark, Python Stored Procedures and UDFs

Better Practice 1: Think about the generated SQL when working with Snowpark data frames.

Even though Snowpark client API is very similar to Pyspark API, it is important to remember that Snowpark generates a SQL query behind the scenes and when using Snowpark it is always a good practice to think about SQL best practices like,

  • Avoiding SELECT * statements by using session.table(“CUSTOMER_PROFILES”) for wide tables in the cases where the workload requires a handful of columns only and using session.table(“CUSTOMER_PROFILES”).select(col(“col1”), col(“col2”),…) instead.
  • Avoiding over-use of df.cache_result() calls with larger data frames to minimize I/O unless a data frame is used many times.
  • Materializing computationally intensive transformations.
  • Following general SQL performance optimization practices like clustering tables, using Snowflake constructs approppriately.

During development and testing, take a look at the SQL query generated by Snowpark in the Query History to make sure the order of chaining of data frame functions are appropriate and queries are executed efficiently based on the query plan.

Better Practice 2: Try to maximize server-side execution with Python Stored Procedures and UDFs to minimize data transfer to the client-side.

Snowpark API has a client-side component so you can certainly execute Snowpark in your local or SaaS-based IDE/notebook interface during experimentation and development connecting to your Snowflake account. However, the true value of Snowpark comes into play when you wrap your Snowpark code in a Python Stored Procedure (anonymous vs named) and execute it entirely on the server-side for deployment in production. (See some best practices for creating stored procs here.) You can use Snowflake Tasks or any other orchestrator for orchestration and scheduling. If your stored proc is for one-time execution, Anonymous Stored Procedures (in Private Preview) are a better choice. Please note that there are a few exceptions to execute code on the client-side today (e.g. using packages with native code, use of GPUs, external connection), where it might make sense to convert Snowpark data frames to pandas data frames and execute code on the client-side.

On the other hand, User-defined functions (UDFs) are mechanisms to execute functions on the server-side without transferring any data to the client. UDFs can be anonymous (for one time execution in a session) or named, which could be created as temporary (session level) or permanent. It is a better practice to choose unique names for the named Python UDFs in the same schema and avoid overloading. (See all the other best practices to keep in mind when generating UDFs here.)

It is also generally a better practice to create a named permanent UDF as standalone and manage the deployment of the function independently from the code calling this UDF unless there is a good reason to bundle.

Last but not least, if your UDF is accessing a static file resource in a stage, it is a better practice to use cachetools package to cache the file as shown below:

https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.udf.html

Better Practice 3: Manage third party Python dependencies wisely.

Snowflake has a partnership with Anaconda to access 3rd party packages securely from the Snowflake Anaconda channel. If the Python package you’d like to use is not in this channel, you can request these to be added using https://community.snowflake.com/s/ideas and until these are added, you can download the packages (as long as they are based on pure Python), upload them into a stage and use them in your stored proc or UDF.

It is a better practice during development to build your code using the latest stable versions of third party packages without including dependencies (conda can resolve those). See more here for more on the third party packages and here for using third party packages in your UDFs. It is a recommended practice to leverage DESC py_udf(); command and check the installed_packages paramater to see what versions of dependencies are used at run-time.

Better Practice 4: Use Vectorized UDFs for feature transformations and ML scoring

Vectorized UDFs and batch API are designed to execute scalar UDFs faster. You should use the Python UDF batch API if you are using 3rd party Python packages that operate more efficiently on batches of rows. Some feature transformations that need to be done row by row (via a third party Python package) or ML batch inference are typically the workloads where you should be thinking about using vectorized UDFs.

Better Practice 5: Use High Memory VWHs (Preview feature) for workloads that are memory intensive

High Memory VWHs (HMVWs) were introduced during the Snowflake Summit 2022 in June and they are in Private Preview as of Sept 2022. HMWHs are your friend if you are working on large datasets for train/test data splits or model training, particularly when using packages like scikit-learn. Consider using a HMWH when you run into 100357 (P0000): UDF available memory exhausted error during development and/or ML experimentation.

It is also a better practice not to mix other workloads with the workloads that require HMWHs and even if you need to mix them, consider calling the session.use_warehouse() method to switch back to regular VWHs.

These are some (definitely not all) initial “better practices” that have great potential to be “best practices” when Snowpark Python is Generally Available in the near future. In addition to these, please always keep in mind the general, well-known development best practices, like storing secrets (e.g. connection param values) in a secret repository, writing modular code, proper error handling and logging (see the Snowpark logging feature, that is in Preview), and following CI/CD best practices like using source control, writing unit tests for each function and automating build and deployment of your code.

For more information, please check out the documentation and other great resources:

Snowpark API: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/index.html

Quickstarts: https://quickstarts.snowflake.com/ (Search for Snowpark)

Snowpark Python Deep Dive session: https://events.snowflake.com/snowday/americas/agenda/session/651839

Under the Hood Snowpark UDFs: https://www.youtube.com/watch?v=-W1aL8XwvkE

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect