Operationalizing Snowpark Python: Part One

Caleb Baechtold
Snowflake
Published in
14 min readNov 8, 2022

--

In part one of this post, we will outline some of the unique challenges associated with putting Snowpark Python code into production, discuss the components in more detail, and outline some code design principles to follow when building with Snowpark Python.

In Part Two, we will discuss relevant CI/CD concepts for Python developers working with Snowpark Python.

Core Challenges

Why is managing Snowpark code in production different from managing traditional/legacy code in production? In many ways, it is not. In fact, if you are already familiar with building applications and capabilities that are powered by Snowflake, and utilized things like Stored Procedures, User-Defined Functions, etc. (whether SQL or Snowpark in a different language), then managing Snowpark Python code is not fundamentally different from how you are doing the same work today. There are some environmental constraints to the server-side Python runtime in Snowflake, but these are similar to how Javascript and Java capabilities are constrained in Snowpark, and so the same principles and practices you are following for your existing powered by Snowflake capabilities will apply to Snowpark Python.

The differences and challenges become more apparent for teams and applications that are built on Python, but not built on Snowflake. Even if Snowflake acts as a data source and/or sink within your existing architecture, Python capabilities that run externally to Snowflake (in serverless or cloud-hosted VMs, etc.) vs. those that are built on Snowpark Python have some key differences to be aware of. Fundamentally, capabilities built in Snowpark are just Python code, and so many of the same practices and tools that team’s use today for managing their existing code bases are still relevant. The primary considerations associated with building on Snowpark stem from the fact that the end code and application is deployed into a fully managed SaaS platform, and so the infrastructure and compute frameworks in which your code must work are not entirely open to your control and discretion. Additionally, the application is fully integrated with your data platform, which yields many benefits, but is a different paradigm than most applications are developed in, and so the recommended design patterns differ. Deploying Python code into a SQL-first environment also places limitations on what kinds of data and how data can be passed between modules, functions, etc. Snowpark applications are also different from other applications that may be designed to be up and running continuously, 24/7. Snowpark is a better fit for on-demand and/or scheduled/orchestrated jobs and applications. Much of the following discussion centers around (1) differences in managing deployment of Snowpark code to production environment and (2) designing Python applications to run in Snowpark compared to in traditional compute environments.

Snowpark Components

The broad term Snowpark Python refers to two fundamentally different, but also tightly coupled, components. One is a client dataframe API, which is available as a Python package and can be installed into any Python environment (support for Python 3.8 is GA as of Nov 7, 2022), leveraged by any Python application, etc. to push down dataframe transformations into Snowflake compute. In contrast to that, Snowpark also has a server-side runtime which allows you to deploy more arbitrary Python code into a Snowflake-managed runtime, to be executed right alongside your data. There are several different mechanisms for how Python code can be deployed into and executed in the server side runtime, which will dive into in more detail below. This distinction is important because, while there are shared challenges that should be addressed for Snowpark Python broadly, each of these components also has specific and unique challenges that we will aim to address throughout this document. In general, we will refer to Snowpark Python to encompass both components generally, the client and/or dataframe API to refer to the installable Python library, and the server-side runtime to refer to the Snowflake managed Python execution environment. In this post we will summarize the core components, but you should refer to the Snowpark Developer Guide for Python for additional detail, code examples, documentation, and more.

Client Dataframe API

The Snowpark Client Dataframe API is a Python library that can be installed into any environment where you can run a Python kernel, and has connectivity to your Snowflake account (note: currently the dataframe API is only Python 3.8 compatible). The Dataframe API provides dataframe-style Python syntax for performing push-down of queries, transformations, and more into Snowflake’s managed compute environment.

Fundamentally, the Snowpark API provides corresponding Python methods for common SQL operations, in a familiar syntax to the PySpark Dataframe API, with considerable overlap in functionality to what you can do in PySpark today. It is important to note that Snowpark uses a fundamentally different back-end compute engine compared to PySpark, but syntactically and functionally will look and feel very similar to the PySpark Dataframe API. The API operates on Snowpark Dataframes, which are pointers to tables, views, etc. inside of Snowflake. The corresponding operations for a Snowpark API call are lazily executed on a Snowflake virtual warehouse. This means that no Snowpark dataframe operations are executed in the Python client’s compute environment directly, and so the computational requirements of the client can be extremely low. Snowpark also provides convenient methods to bring Snowpark dataframes in-memory on the client as pandas dataframes, and vice versa (writing pandas dataframes back to Snowflake). The Snowpark API requires that the source data is located in Snowflake in order to perform push-down of transformations into a Snowflake Virtual Warehouse. In addition to methods corresponding to SQL operations, the Snowpark API contains other helper functions, along with the ability to invoke server-side Snowpark Python objects (UD(T)Fs and Sprocs, which are described in more detail below) from a Python client runtime.

UD(T)Fs & Stored Procedures

The Snowpark Python server-side runtime makes it possible to write Python Stored Procedures and User-Defined (Table) Functions (UD(T)Fs) that are deployed into Snowflake, available to invoke from any Snowflake interface, and execute in a secured, Python sandbox on Snowflake virtual warehouses.

Python UDFs and UDTFs scale out processing associated with the underlying Python code to occur in parallel across all threads and nodes comprising the virtual warehouse on which the function is executing. There are three different UDF-type mechanisms with Snowpark Python:

  • User-Defined Functions (UDF) are one-to-one scalar operations: for a row of input data passed to the function, a single output is produced. Rows of data get processed in parallel across Python processes on each node within a virtual warehouse.
  • Vectorized/Batch UDFs are similarly one-to-one scalar operations like the above described UDFs. The difference is that UDFs parallelize the operations of the UDF on individual rows of data. Vectorized UDFs parallelize the UDF operation on batches of data (multiple rows). Functionally, vectorized UDFs still produce a single output for each input row of data, however the data is batched to individual instances of the UDF (many rows passed simultaneously, and many batches processed in parallel). The reason for this is that many Python operations based on Pandas, Numpy, scipy, etc. that may be used in Python UDFs are optimized to run as vector-style operations. When individual rows are processed, as in a standard UDF, the UDF is not taking full advantage of the array-based optimizations that are built into the underlying Python libraries. Vectorized UDFs allow you to do this; they are functionally the same as normal Python UDFs, however data is batched and operated on in bulk to take advantage of array-based optimizations in various common Python libraries.
  • User-Defined Table Functions (UDTFs) are Python functions that require stateful operations on batches of data. Vectorized UDFs randomly batch data for more optimal, accelerated processing, however they do not allow the user/developer to determine what data gets batched, and how the whole batch of data gets processed: only individual rows. UDTFs allow many-to-many and many-to-one row processing. Every UDTF has a process method, and an endPartition method. The process method defines what work is done as individual rows in a batch are processed (you may or may not return some sort of output per row, depending on the underlying functionality). The endPartition method defines what work is done on the entire batch of data following processing of all individual rows, and may include some sort of stateful work that has been built up as the batch has been processed. When UDTFs are invoked, the partition by expression allows you to specify what fields in the underlying data are used to batch the data, i.e. if you partition by COUNTRY, then all records with COUNTRY=US are processed in the same batch, all records with COUNTRY=CHINA are processed in the same batch, etc.

Each type of UD(T)F should be used under different circumstances, but the above describes fundamental functional differences between them. For more information on how Snowpark UDFs work under the hood, check out this video from Snowflake engineering.

In addition to UD(T)Fs, Snowpark Python’s server-side runtime provides support for Python stored procedures. Stored procedures can be thought of as a more arbitrary script that gets executed on a Snowflake virtual warehouse. A key difference is that stored procedures are single-node; so, to perform transformations or analysis of data at scale inside of a stored procedure, stored procs should leverage the client dataframe API or other deployed UD(T)Fs to scale the corresponding compute across all nodes of a virtual warehouse. This is in contrast to pulling all data from a query into the stored procedure and manipulating it e.g. with pandas. Instead, scale computation across the virtual warehouse from a stored procedure by leveraging the dataframe API and UD(T)Fs for more computationally intensive work. This is further illustrated in the below code design principle: you should avoid pulling data directly into a stored procedure to the maximum extent possible (with some exceptions: this is detailed more extensively later in this article). What stored procedures do provide, however, is a simple way to deploy a script-like program flow into the Snowpark server-side runtime, which can be “kicked off” as a job via tasks or direct SQL “CALL sproc” statements.

Code Design

Core Snowpark Python capabilities can be thought of in five buckets.

Design Principles

  1. The guiding principle for building operational capabilities with Snowpark Python is to not pull data out of Snowflake and process it in a client environment wherever possible. Instead, leverage the Snowpark client dataframe API for push-down of SQL-like operations, and the server-side runtime to deploy more arbitrary code as UD(T)Fs that can scale across Snowflake virtual warehouses and process data more efficiently. This can be thought of as a similar programming practice to pushing down SQL via JDBC in other tools or applications.
  2. Use the Snowpark client dataframe API in your applications when querying and transforming data, versus fetching all of the data into a client application and processing it there. This can effectively scale out your data transformations regardless of whether the entire application is running inside the Snowpark server-side runtime environment or in an external to Snowflake Python environment. It is worth noting that you can use SQL directly instead of using the dataframe API, however many developers find it cumbersome to try and author and use SQL within an application written in another language, whereas the dataframe API provides a familiar, Pythonic syntax to achieve the same results at scale.
  3. Code that manipulates, analyzes, or transforms data should, to the maximum extent possible, be built as UD(T)Fs (the decision between UDF vs. UDTF largely comes down to functionally what the code/transformation is actually doing) or should be implemented using the Snowpark client dataframe API. This is because the dataframe API and UD(T)Fs allow you to scale and parallelize the computational work being performed across the virtual warehouse.
  4. Python Sprocs are best suited for control flow of Python programs. Think of the sproc as the main script or application- it initializes objects, maintains state, etc. Within the sproc, any data-intensive computation should call the dataframe API or use UD(T)Fs that have been separately built and deployed. Similar to the above principle of “do not extract data to the client,” you should generally avoid pulling data into the stored procedure’s memory, as this limits your ability to scale. Instead, push the work performed in the sproc to the virtual warehouse does using the dataframe API and UD(T)Fs. The sproc can be thought of as a “unit” of work within your application, which you might orchestrate using tasks or an external orchestration service (e.g. Airflow). Stored procedures don’t scale horizontal the way that UD(T)Fs do and DataFrame API operations, so high concurrency / bursty logic may be initiated by a stored procedure, but should live in DataFrame or UD(T)Fs for optimal throughput.
  5. Just to re-emphasize: you should avoid pulling data directly into the Sproc to the maximum extent possible. Sprocs are restricted to executing on a single node in the virtual warehouse, and thus are restricted to the computational resources of that node. Snowpark-optimzied warehouses (now in Public Preview) will extend the ability for sprocs to perform more memory/data-intensive work, but as a rule of thumb, sprocs should not be tasked with performing significant computations themselves, and within a sproc you should offload work to the Snowpark API and/or UD(T)Fs.
  6. The exception to principle 5 is computation that cannot be performed in a distributed manner, and requires access to an entirety of a dataset (or large sample) in order to be performed. For example, single node machine learning model training should generally be performed in the context of a Stored Procedure, but this is a rare example where data-intensive computation should be performed in a Stored Procedure. Without digging into too much detail, there may be use-cases around model training specifically where parallelizing via UD(T)Fs makes sense but this is a small and specific set of use-cases.
  7. Standard Python code design principles should be followed around modularity of code, reusability, etc. It is good practice to bring commonly-used utilities in your application as a 3rd-party custom package dependency, so that the code can be developed/maintained once and potentially leveraged throughout your ecosystem of Snowpark applications.
  8. Custom classes and objects that you use throughout your application need to be assessed for compatibility with UD(T)Fs. For example: suppose you have a part of your application that takes in a bunch of data, does some analysis on it, then constructs a custom Python class object based on the output of your analysis. This is potentially a good fit for a UDTF, however, we don’t support returning arbitrary Python objects from Snowpark functions. As such, you will need to consider serialization/deserialization methods of your object to supported SQL data types, e.g. implementing to_json() and from_json() constructors/serializers so that you can initialize your class objects from Snowflake table data. You might also consider binary serialization- regardless of approach, this needs to be considered.
  9. Instances of UD(T)Fs will be re-used within a single query-set. You can take advantage of this by moving initialization code or variables / temporary state that can be reused across executions outside of the function method and into the global / static portion of the code. This will mean on subsequent executions those variables or temporary state could be re-used without having to re-initialize each and every execution. This will be especially important as things like external access become available, where you will want to put HTTP clients or connection pools outside of the function declaration (see AWS Lambda / Azure Functions for similar design patterns and best practices).
  10. Vectorized (batch) UDFs allow you to perform the same operations as UDFs, while taking advantage of Python’s own internal array-based optimizations on Pandas and Numpy-type objects (this is described in more detail above). As a rule of thumb, it is generally best practice to deploy UDFs as Vectorized UDFs if any of the underlying operations on the data rely on Numpy/Pandas or are implemented using vector operations. Doing so simply optimizes data distribution to the server-side runtime, and allows Snowpark to take advantage of built-in Python optimizations.
  11. In addition to Snowflake’s own caching layers, Snowpark Python code should generally follow Python caching best practices, with the exception of needing to cache query results (as Snowflake will do this for you at various layers in the architecture). But, for example, if you have a UDF that performs text tokenization in your application, and to use it you need to load a built tokenizer from stage, you should wrap a function that loads the tokenizer into the UDF using cachetools, to avoid repeatedly loading the tokenizer from stage into the UDF. This is the most common use-case for caching within Snowpark code- when an artifact has to be loaded into a UD(T)F from stage.

When designing and building greenfield applications on Snowpark, these design principles should guide decision-making around structuring and organizing your code base.

Porting Existing Applications to Snowpark

Many customers have been asking how they can port existing capabilities and applications to run in Snowpark, both to improve scalability, performance, governance, and security while simplifying their architecture and removing complexity associated with owning and managing infrastructure. Beyond the initial assessment of “can this application and/or code be supported by Snowpark”, there should be extensive discussion and conversation around how to most optimally move code over. In the case of migrating PySpark applications specifically to Snowpark (versus other more generic Python apps), Snowflake has partnered with Mobilize.net to provide a free PySpark to Snowpark code analysis tool that can help determine whether your existing codebase is a good candidate for migration to Snowpark. Additionally, Snowflake’s Professional Services teams and SI partners can provide full migration support.

In the case of more generic Python applications, Python scripts could just be dropped into Python sprocs and are likely to more or less “work,” but this is likely to be an extremely inefficient implementation, unless the particular script is computationally light. In particular, these existing applications are likely pulling data out of Snowflake, which as we emphasize above, is not the recommended pattern for building applications on Snowpark. The two primary questions to consider are (1) how does code that extracts data from Snowflake need to be refactored to leverage more compute push-down. Additionally, (2) Data intensive applications especially need to be evaluated to determine what functions, methods, classes, etc. are better served by leveraging Python UD(T)Fs and taking advantage of scalability and performance in Virtual Warehouses. Existing Python applications may not inherently follow the above code design principles, and so assessments should be performed to identify how to refactor into a code design that fits these principles without introducing a significant amount of overhead and friction.

Beyond just pure code analysis, another approach for performing this migration assessment is to start with a code base and produce a logical flow diagram of the application/script, broken up into individual computational components. Many teams may already have this kind of software design/arch diagram available, but if not, it is a good start for understanding how data and computational work is distributed throughout your application. This diagram may also include custom class objects, etc. For each function, class, method that is used, you should evaluate what data must be provided, and what output is produced. Is the output compatible with Snowflake data types? What will consume the output and how will it be used? Is the computation performed intensive enough to warrant being performed on a cluster of compute nodes in a virtual warehouse? How much custom configuration is required upon each iteration of the function/object’s usage?

By filling out this diagram, it will begin to become clear what should exist in the logical flow layer of the application (which may be a good fit for a Snowpark Python sproc), and what needs to be offloaded to a UD(T)F, and thus potentially refactored. This will also indicate to you what custom classes and objects need to have serialization methods for writing to/initializing from Snowpark table structure, as described in the design principles above.

From this point, you can begin to understand what code can be lifted and shifted, vs. what needs to be refactored, reimplemented, or modified. Additionally, Snowflake Professional Services has offerings for more hands-on assistance with migrating applications to Snowpark that goes beyond principles and best practices.

Conclusion

In this post, we detailed what Snowpark Python is (client dataframe API and server-side runtime) and how it can best be used in Python applications. We outlined core design principles that should be followed when designing or migrating applications for Snowpark Python. In part two, we will look at how to incorporate these new capabilities into existing CI/CD and DevOps practices, including how Snowpark Python may be both similar, and quite different, from other development frameworks.

--

--

Caleb Baechtold
Snowflake

ML/AI Field CTO @ Snowflake. Mathematician, artist & data nerd. Alumnus of the Johns Hopkins University. @clbaechtold — Opinions my own