Operationalizing Snowpark Python: Part Two

Check out Part One if you haven’t already: https://medium.com/snowflake/operationalizing-snowpark-python-part-one-892fcb3abba1

In this post we will outline relevant CI/CD topics for Snowpark Python developers. In particular, we highlight unique challenges that Snowpark introduces (which may or may not be unique compared to existing Snowflake SQL CI/CD practices) when compared to traditional Python development practices. We also point out where existing DevOps and CI/CD practices for Python application development will still apply to Snowpark Python capabilities. Some of the specific topics that we discuss are:

  • Code Versioning
  • Packaging & Deployment of Snowpark Capabilities
  • Dependency Management
  • Testing
  • Orchestrating Snowpark
  • Monitoring & Logging

For a more introductory overview of what Snowpark Python is, and code design principles for building with Snowpark, check out Part One of this series.

Version Control Systems (VCS) Integration

Snowpark code integrates well with all of your standard VCSs- you can write your Snowpark UD(T)Fs and sprocs in notebook environments, standard IDEs with .py files, etc. and version control these artifacts using Git just like you version control other code. The same can be said for branch management. Storing UDF definitions in .py files allows you to use the Snowpark client API to deploy functions from Python files into the Snowpark server side runtime.

Code that leverages the Snowpark client dataframe API should also be version controlled in the same way that the application would otherwise be normally controlled. Fundamentally, you will be writing Python code in .py files that import and use Snowpark client API methods just like the many other modules/packages you are likely using. You should capture this dependency in your code’s requirements.txt (or other corresponding file), and version control the code as you normally would.

Snowpark server-side runtime code should also be version controlled using existing VCSs; stored procedures and UDFs should be defined in .py files from your IDE of choice, and either decorators or separate deployment scripts in CI/CD pipelines should be responsible for deployment of this code into the server-side runtime. The code itself should be managed just like other Python application code, with your CI/CD pipelines (which are described in more detail below) responsible for pushing updated server-side capabilities based on actions in the VCS (e.g. pushes to main, release cuts, etc.)

Native GitHub integrations are also on Snowflake’s roadmap to further Snowsight as a potential Snowpark IDE of choice for some customers, as well as to make hosted code available in Snowflake natively without external CI/CD tooling necessary.

As a rule of thumb. customers should version control Snowpark code today the same way that they do other application code, with whatever integrated VCS system they use in their existing CI/CD pipelines.

Automating Deployment of Server Side Objects

The underlying code that backs a UD(T)F or Sproc should be produced, version-controlled, etc. using common software engineering IDEs and VCSs (e.g. GitHub). As a result, one should consider how standard deployment practices for traditional software engineering products should be automated and implemented for Snowpark code. In particular, repositories for versioning Snowpark server-side objects should include automated deployment mechanisms to a Snowflake environment upon pushes to protected branches (e.g. main). You can use standard processes such as GitHub actions to perform these automated deployments, but there are some Snowpark-specific questions to consider when integrating these deployments into your existing processes:

What objects get deployed upon initiation of your CI/CD pipelines? It is easy to determine what in your code base is an object that could be deployed/re-deployed, but a separate question is, what set of objects should be deployed/re-deployed? Should only modified functions be re-deployed? Should everything be re-deployed? Only new functions initially deployed? Should sprocs with dependencies on UDFs be redeployed, or only the UDFs? These sorts of questions need to be asked and answered when integrating with your CD pipelines.

From a technology perspective, because all of these objects can be deployed to Snowflake via code, it is very straightforward to implement the deployment mechanism within your existing CI/CD pipelines, be they GitHub actions, Jenkins, etc. The primary questions to consider are: what should be deployed when, how often, and under what circumstances? It is not as simple as typical application deployments, where you can just redeploy all relevant services/components of an application upon pushes/merges with little consequence. In the case of Snowpark server-side objects, you will need to implement logic within your deployment pipeline that determines how and what should be deployed from a code-base. You also begin to introduce database-specific challenges with DevOps for code + data when deploying Snowpark server-side objects (this is discussed in more detail below). Snowflake has published existing guides for CI/CD with Snowflake that still apply for Snowpark Python; the fundamental difference now, with Snowpark, is just that the underlying source code is Python. Otherwise, existing CI/CD practices with Snowflake are generally applicable to Snowpark Python server-side components and the applications that use them.

Automating Deployment of Apps that Use the Snowpark Client API

Applications that use the Snowpark Client API for push-down of dataframe-style operations on data or invocation of server-side Snowpark capabilities should be deployed using the same automated deployment techniques as traditional applications, and as such introduce far less complexity to your existing CI/CD pipelines when compared to server-side artifacts. These types of applications are still executing in a Python runtime outside of Snowflake, and using the client API to connect to and push down computation in Snowflake. Whether they are deployed using some sort of containerization, or onto cloud VMs, or otherwise, the Snowpark client API is fundamentally just a library dependency of the application code. As a result, the application code itself (and corresponding dependencies) should be deployed using standard Python CI/CD practices. The primary consideration with applications that only leverage the client API becomes credential-handling, which is also the case for applications connecting with Snowflake via JDBC/ODBC as opposed to now a Snowpark Python Session. For more information on authentication with Snowflake, please refer to the documentation.

SnowCLI for Deploying Snowpark Code

Snowflake has also built and open-sourced a command-line tool called SnowCLI to simplify managing Snowflake applications. It is not a Snowpark-specific tool, but includes support for Snowpark Python server-side UD(T)Fs and Python Stored Procedures, allowing you to auto-deploy, wrap 3rd party or custom dependencies, and more all from a dev CLI.

You can find SnowCLI and accompanying documentation on Snowflake-Labs GitHub, and a video demonstrating the tool on YouTube from Jeff Hollan in Snowflake Product.

Server-Side Artifact Versioning

How do you manage knowing what versions of Snowpark capabilities are actually deployed into Snowflake? In many ways, Snowpark objects can be thought of as individual microservices, or functional components of a microservice, and so many DevOps and automation principles around common software architecture patterns will apply, though there is a fundamental difference with server-side Snowpark objects: in particular, the end deployed artifacts ultimately exist in a SQL-ized universe inside of Snowflake. What does this notion of deployment as database objects mean for your DevOps practice? For Python specific applications and developers, it is worth familiarizing yourself with DevOps on Snowflake for SQL; fundamentally, your Snowpark Python server-side artifacts are deployed as SQL-like objects (functions and stored procedures, that reside in databases, schemas, and may be orchestrated using database components) that actually execute Python code. Consequently, DevOps practices around large-scale data intensive applications and databases apply to Snowpark Python that may be different from traditional Python DevOps. Please refer to the Snowflake guides for these topics for more insight:

Dependency Management

There are multiple layers of dependency management to consider in Snowpark code, specifically:

Snowpark Objects’ (UD(T)Fs/Sprocs) dependencies on Anaconda packages and libraries

Snowflake’s partnership with Anaconda simplifies dependency management for serve-side capabilities that rely on packages provide in the Snowpark/Anaconda channel. This is the simplest set of dependencies to manage with Snowpark, as package dependencies are simply declared at object creation, and Snowflake and Anaconda handle the rest for you out of the box. Refer to our documentation for more detail.

Snowpark Objects’ (UD(T)Fs/Sprocs) dependencies on custom, home-grown Python packages and libraries

To support continuing to use Python development best practices, you are likely to have home-grown custom Python packages that your Snowpark server-side code may import and use. Some of these may be application-specific, others may be generic utilities type modules that your team shares and uses. It is straightforward to use these in Snowpark- the underlying custom modules can be uploaded to Snowflake stages and specified as import dependencies for your Snowpark UD(T)Fs and sprocs. The primary topic of concern then becomes: how are updates to your custom modules propagated to your Snowpark code? How do you verify compatibility with changes, identify breaking changes, etc.? This again becomes a primarily CI/CD oriented question: if you push changes to your utility module’s repository, how do you identify what Snowpark elements need to be updated as part of the CI/CD around that module? Dash Desai published a blog demonstrating how one might do this using GitHub actions, and you can envision how the same approach could extend to other CI/CD frameworks. Again, you will have to assess what elements need to be automatically updated based on these changes, and establish a framework for identifying and automating those updates when appropriate. The BYO-code approach also relies on specifying, managing, and handling the dependencies of your custom code on other packages, which may or may not be available inside of Anaconda. Those two scenarios are subsequently discussed.

Custom code used in Snowpark’s dependencies on 3rd party libraries available in the Snowflake Anaconda channel

When your custom code dependencies (outlined above) are also dependent on 3rd party libraries which are available in the Snowflake Anaconda channel, managing those dependencies is straightforward. You simply need to specify those Anaconda dependencies in the definition and registration of Snowpark UD(T)Fs/sprocs that leverage your custom code. You will have to check for and maintain version compatibility, but fundamentally you just need to specify the packages as dependencies in your Snowpark objects that leverage your custom code. 3rd party dependencies that are not available in the Snowflake Anaconda channel is a much more complex situation.

Custom code used in Snowpark’s dependencies on 3rd party libraries that are not available in the Snowflake Anaconda channel

The same way that you need to BYO-code for dependencies on custom code in your Snowpark objects, you will also need to do the same for your custom code’s dependencies if they are not available in the Snowflake Anaconda channel.. And for those packages’ dependencies, and so on and so forth until you reach a set of requirements that are available in Anaconda, or basically equate to the Python stdlib. Suffice to say, the more complex your custom code dependencies become, so too does the overhead of managing those dependencies in Snowpark if they are outside of the Anaconda ecosystem. Customers should leverage our Anaconda integration to the maximum extent possible, and be very careful about what dependencies they introduce into the ecosystem with custom and/or 3rd-party libraries not available in our Anaconda channel (primarily from an overhead/management perspective, but also from a security perspective. There have been many instances of malicious code being introduced to PyPi packages).

Testing

A common question that comes from Snowflake customers with respect to Snowpark Python is: how do we write unit tests for our Snowpark code? The important thing to consider about this question is: what is their definition of unit tests?

In the truest, traditional sense of unit testing, where the Snowpark code is tested in a separate, isolated environment that is independent of actually integrating with a Snowflake instance, Snowpark does not support unit testing. Typically, for code that interacts with a database, as part of your unit testing you would mock a database object in your unit test setup. Snowpark does not support a local context that would allow you to mock the server side runtime, or to mock a Snowpark dataframe that would be loaded from a local test data file, for example. You can use the Snowpark client’s create_dataframe method to create temporary dataframes, but these fundamentally still require connectivity to Snowflake. Traditional engineering teams may find this unsatisfactory, but it is the current state with respect to local context-supported unit testing.

In our opinion, testing Snowpark code using the above, rigid definition of unit tests doesn’t make a lot of sense, and testing with Snowpark should be thought of and accepted as integration testing. Developers should write tests for isolated blocks of functionality that rely on Snowpark to verify that functions/operations produce expected results, however testing these tests will require connectivity to Snowflake. While this may not fit under an especially rigid definition of unit testing, it is the proper way to test Snowpark code. You can test your Snowpark code anywhere that you (1) can run Python code and (2) can connect to Snowflake. Engineering teams will think of this as integration testing, which is probably accurate, and is also the appropriate way to test Snowpark code. These tests can then be incorporated into your CI/CD pipelines very easily.

Orchestration

Snowflake Internals vs. External Tools

If your Snowpark code design fits into the pattern of several Python stored procedures that are executed in a DAG manner, Snowflake internal tasks may serve as a convenient way to orchestrate that application. You can build a DAG of tasks that simply invoke these stored procedures in the relevant order and manner, and that may work extremely well. There are a number of new announcements around pipeline visibility and management in Snowflake that will further push this as an attractive path forward for orchestrating jobs in Snowflake, be they Snowpark or SQL. Additionally, the Snowpark client API contains special methods and functions beyond just dataframe manipulation that allow it to be the authoring and orchestration tool to develop, build, and deploy server-side objects that are eventually executed as DAGs. Because the API allows you to deploy and execute these objects from any Python IDE or environment of your choosing via code, you can very easily integrate it into

By the same token, teams that are already using orchestration tools for their Python code (e.g. Airflow, Metaflow) can continue using those tools for Snowpark applications while significantly reducing the burden associated with infrastructure overhead and management of using those tools today. In particular, Airflow tasks can consist of Snowpark client API code and/or invoking Snowpark server-side components, where 100% of the actual compute associated with those jobs is pushed down into Snowflake. As a result, the necessary additional infrastructure to run your Airflow jobs becomes minimal, as the Airflow tasks and DAGs become a simple, lightweight puppeteer of code that ultimately results in computation being performed in Snowflake’s fully managed compute layer. This allows you to continue using your Python-focused external orchestration tools as you do today, while decreasing your cost and overhead associated with using them by taking more advantage of Snowflake’s managed offering.

Monitoring & Logging

Snowpark Python now supports logging from sprocs and UD(T)Fs to a Snowflake event table in PrPr. It is straightforward and simple to implement; you should use Python’s stdlib logging module inside of your Snowpark Python code, and include standard logger.info(), logger.debug() type statements in your code. These statements are logged as part of a JSON message in the Snowflake events table, which also includes substantial metadata about the job that is creating the log statements. For more information on this topic, speak with your Snowflake account team to get access to the Python Logging PrPr.

Snowpark Dataframe API calls are logged by default in Query History in Snowflake, and can be viewed in the UI and queried via the information schema with appropriate permissions. Developers building applications that use the dataframe API externally to Snowflake will need to design and manage their own logging if they would like separate application-level logs from what is provided by default for API calls via Query History. These log statements would be captured in whatever application-level logging mechanism you have implemented, and would need to be separately ingested into Snowflake if that is the desired destination for your log messages. Generally speaking, this logging should be thought of as consistent with standard Python application logging pratices.

Conclusion

In this post we discuss a number of DevOps concepts as they relate to Snowpark Python. Many of these concepts do not change in a Snowflake SQL vs. Snowpark Python paradigm shift, though we also point out some of the key differences with Snowpark Python applications vs. other Python applications. Check out the previous post for an overview of Snowpark Python and best practices/code design principles for building applications with Snowpark.

--

--

Caleb Baechtold
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

ML/AI Field CTO @ Snowflake. Mathematician, artist & data nerd. Alumnus of the Johns Hopkins University. @clbaechtold — Opinions my own