Migrating from PySpark to Snowpark Python Series — Part 2

Dated: July 2022

Contributors: Naveen Alan ,Prathamesh Nimkar, Ritabratasaha, Eda Johnson,Su Dogra, Chinmayee Lakkad, Vikash Kumar,Venkat Sekar

Overview

During the Snowflake Summit 2022, Snowflake announced a new developer experience called Snowpark for Python for public preview. In the first part of the series, you have learnt how easily you can get started with migrating the code written in PySpark to Snowpark Python.

If you have missed the first blog of the series, we recommend going through that one before this as the contents in this blog is the continuation of the first one and the first one is interesting as this one too.(Trust me!)

Now, as promised in the first part of this series, in this second part we have come up with how different types of functions (User-Defined Functions, User-Defined Table Functions) can be migrated from PySpark to Snowpark and also we will cover the core function parity between PySpark and Snowpark for Python.

Image by Author — Topics Covered

Please note, we are providing an exhaustive live mapping document with all the PySpark DataFrame API (Spark 3.2.1) (at the time of writing this article July 2022) and its equivalent functions in Snowpark for Python. This can be a great reference document for use during your migration journey to Snowpark, which as a framework brings your code to where you need it — closer to your data (which when seen from the governance lens, is quite important!).

We have included code snippets for each API to help show you their usage within Snowflake. Please also note that this document is up-to-date with the latest version of Snowpark for Python (v0.7.0) available for preview.

Since Snowpark for Python is in public preview, you may find a few functions from the open source PySpark framework that are not supported in Snowpark for Python as of today. In the mapping document we have included code blocks wherever possible as a workaround for the functions which are not supported yet. These alternatives present one of the many ways through which the functionality can be implemented without being confined to a single way of implementing business logic.

Please note, this blog assumes you understand the core Snowpark for Python and have tried developing with Snowpark, if not (Don’t worry) here is the link to a quick start guide that will help you get started with this great feature.

You can also find the Snowpark for Python converted code for multiple Spark based notebooks here. Please note we will be adding more examples for different use-cases overtime, so be sure to bookmark this resource as well

User Defined Function Types

From an extensibility perspective, Snowpark, like PySpark, also supports the ability to write custom functions — User Defined Function (UDF), User Defined Table generating Function (UDTF).

  • UDFs — UDFs operate on a single row and produce a single row as output.
  • UDTFs — UDTFs operate on single rows and produce multiple rows as output.

Following sections show how each of these function types in PySpark can be migrated into Snowpark Python.

UDFs

In this section you will learn how to create UDF in PySpark and Snowpark and how to use them on DataFrame columns. Constructs for creating UDF in both PySpark and Snowpark Python are similar as shown in below.

Using udf.register function

Both PySpark and Snowpark for Python provide udf.register method to register lambdas and custom functions as UDFs in the database. Below snippet shows the registration of a lambda function that converts the case of its string argument to upper as a permanent function which can subsequently be used in SQL queries.

Image by Author — udf.register function migration

Using @udf function

In both PySpark and Snowpark for Python we can use @udf to create temporary user defined functions. As you can see from the below snippet, the constructs for creating UDFs are similar. In this example we are creating a temporary function which converts the string column value into lowercase and later used in DataFrame API queries. There are no/minimal changes w.r.t how you create the temporary functions in both the platforms.

Image by Author — udf function migration

UDTFs

Snowpark supports User Defined Table Functions (UDTFs) as well. While UDFs are scalar functions (one row in, one row out), UDTFs return tabular results (one row in, multiple rows out).

UDTFs are not directly supported in PySpark but enabled through Spark SQL’s integration with Hive UDTFs.

In Snowpark Python, UDTFs can be used as below using the @udtf decorator:

Image by Author — UDTFs in Snowpark Python

Function Parity — Snowpark Python Vs PySpark

In this section we will be covering the functions related to the the following classes only, but this sheet covers many other functions supported in Snowpark Python. Always refer to this spreadsheet when you are working on migrations as Snowpark for Python is evolving very fast and new functions are added frequently and this document will be updated accordingly by our team. Constantly we will keep updating the mapping spreadsheet with other class methods.

  • DataFrame
  • Column
  • Windows
  • Grouping Data

DataFrame

This section details the most important class in Spark world “DataFrame”. Below is the table which has the list of few DataFrame functions comparison between PySpark and Snowpark Python and this link has the full list of functions for DataFrames.

Image by Author — DataFrame Functions

As you can see in the preceding table that many of the DataFrame functions are identical in both platforms.This will help customers to easily port their code into Snowpark and the developers to use language of their choice in Snowpark. We have also provided sample codes in the spreadsheet, you can try it yourself :).

Column

PySpark Column class provides several functions to be used with DataFrame which allows us to operate and perform various transformations on the Column values. This section covers all the Column functions in Spark and the equivalent functions in Snowpark Python. You can find the complete list of functions in PySpark and the mapping functions in Snowpark Python here. Below table lists few functions which are similar between PySpark and Snowpark Python.

Many functions are similar in both these platforms, there are functions that are not supported yet in Snowpark. But please note we are only in Public Preview at the moment and Snowflake is adding new functions and the function parity between these platforms will be reduced.

Image by Author — Column Functions

Windows

A window function operates on a group (“window”) of related rows. Each time a window function is called, it is passed a row (the current row in the window) and the window of rows that contain the current row. The window function returns one output row for each input row. The output depends on the individual row passed to the function and the values of the other rows in the window passed to the function. Below table lists few functions which are similar between PySpark and Snowpark Python. You can find the entire list of Windows function here.

Image by Author — Windows Functions

Grouped Data

This class contains a set of methods for aggregations on a DataFrame.This class also contains some first order statistics functions such as mean, sum which are most commonly used functions in data transformations. It represents an underlying DataFrame with rows that are grouped by common values. Can be used to define aggregations on these grouped DataFrames. Below table has the list of few functions and the entire functions list can be found here.

Image by Author — Grouped Data Functions

Conclusion

With the goal of bringing the code to the data (Data Governance) and giving developers the freedom of coding in their language of choice (Developer friendly), Snowflake has taken a great leap in the technology industry, setting great steps to follow. This blog series is to guide developers, partners, and customers in executing the PySpark to Snowpark migration in the right way. What we have covered in this blog are how UDFs can be migrated and the list of function objects parity against PySpark and Snowpark.

One more amazing thing about Snowpark is that you can write one UDF in Python, another algorithm(UDF) in Java and another in Scala (Imagine 3 teams developing a pipeline based on their skillset or porting existing algorithms written in Java or Scala) into Snowpark and call each other within another framework. Like Python UDF called in Java UDF or in SQL directly. Hence providing you the flexibility of using different language (Java, Javascript, Python, SQL) algorithms to work together in a single application. As most Snowflake customers say “IT JUST WORKS” :)

So what’s next?

We are going to use the part 2 outcome to develop an analyzer script which scans a PySpark script and generates the list of functions that need workarounds. This can help you assess how much effort is needed in migrating your existing Spark code into Snowpark Python code. There are other various items to follow too. So stay tuned!

Image by Author — What’s next in the series

PS: Snowpark Python is in Public Preview and we highly recommend trying out Snowpark Python with your existing Spark workloads.

References:

--

--