Migrating from PySpark to Snowpark Python Series — Part 3

Dated: Oct 2022

Co-Author: Su Dogra

Contributors: Naveen Alan, Prathamesh Nimkar, Chinmayee Lakkad

This is the third and final blogpost of our series on Migrating from PySpark to Snowpark Python. In case you have missed the first two, below are the links and we highly recommend you to go through the first two blog posts.

Migrating from PySpark to Snowpark Python Series — Part 1

Migrating from PySpark to Snowpark Python Series — Part 2

In the earlier blog posts of this series, we have shown you how you can get started with migrating your workload from PySpark to Snowpark Python. We have provided you with a function parity mapping document which can help you in building accelerators or tools for code assessment and for code conversion as well. The most important step before code conversion is to assess how much percentage of your code can be reused and how much needs to be rewritten.

In this blog we will talk about how you can perform the code assessment by using the Mobilize.Net SnowConvert for Spark Qualification Tool. This tool will scan through your PySpark code locally on your machine and gives you a migration readiness score which tells how much percentage of the code can be converted to Snowpark Python with no or minimal changes required. And the most interesting part is you can do the assessment at free of cost. Anyone can download this tool and install locally on their machine without any license cost for code assessment. Mobilize.Net SnowConvert for Spark Qualification Tool is a joint effort between Snowflake and Mobilize.Net to help customers and partners understand how ready is their Spark workload for Snowpark.

Migration Process Overview

Below snippet gives you an high level overview on the steps involved in migrating from PySpark to Snowpark Python. As you can see the first two steps are about discovery and mitigation which are the most important steps before you can start converting your code and perform data migration. In these two steps, you identify the tools required for assessment, modules and code dependencies for your existing code base. Then you identify the workaround and determine the manual effort required to re-write the code for the APIs which cannot be reused as is in Snowpark Python.

Image by Author — Migration Phases

Each phase involved in the overall migration has a set of steps. One of the important step in the discovery phase is getting the readiness score and for that SnowConvert for Spark Qualification Tool helps you to achieve that. Once the discovery phase is completed, we need to start working on the mitigation phase where we:

  1. Identify the workaround required for the not supported APIs.
  2. How to reuse the custom modules used in the PySpark code.

Below flowchart can help you in understanding at what stage(s) mitigation might be required.

Image by Author — Mitigation Scenarios

Let’s talk about major blocks from the flowchart:

  1. Checking for Limitations

You might need to identify the limitations and identify the alternative options, time and effort to implement that in Snowpark. Some of the limitations are:

  • PySpark APIs which are not available today in Snowpark Python.
  • Certain Data Science workloads which need GPU based compute.

2. Identify Packages Used

  • There could be certain packages which you have installed like scikit-learn using PIP or Conda package manager. To use these packages as in Snowpark we need to check if these packages are available in the Snowflake Conda Channel managed by Anaconda. If there are packages which you are using but not available in the channel, then you might need to look for a workaround in using these packages(e.g. by importing the libraries you need the Snowpark Session).

Manually checking for the PySpark API’s which used in your codebase but not supported in Snowpark Python today is a time consuming process. To help you to accelerate the PySpark code assessment, you can use the SnowConvert for Spark Qualification Tool.

As I mentioned earlier, this tool will help you to analyze the PySpark code and give you a migration readiness score to convert your PySpark to Snowpark for Python. Migration readiness score is a parameter which will tell you how much percentage of your code can be converted to Snowpark with no or minimal changes. Please note that this tool is not performing any code conversion while you are assessing your code. And while the code is getting assessed, the tool only extracts the PySpark APIs used in your codebase which you have given as an input to the tool.

Downloading and Running the SnowConvert for Spark Qualification Tool

You can go through the below recording to understand how you can download the tool, run the tool and look at the files that the tool outputs. Please note that your PySpark code is not copied to any location outside your environment or box from where you are running the tool. Only the PySpark API references used in your PySpark code are uploaded to the SnowConvert database.

Download URL: https://www.mobilize.net/products/database-migrations/snowconvert/spark-assessment-download

Below video will walk you through the steps involved right from downloading the tool and all the way to running the tool and the next steps involved after running the tool.

High Level Steps from the Video

This tool works for Windows and Mac OS. Once the form is submitted and the tool is downloaded, you will get an email with the License key. This license key is unique to every user.

  • Extract the downloaded Mobilize.Net SnowConvert executable program file and install the tool.
  • Use the same license key to launch SnowConvert for the first time.
Image by Author — SnowConvert License Activation
  • If a new update is available, an update button appears in the notification message on the top right of the tool. Press update now button to download the newest version of Spark Qualification Tool.
Image by Author — SnowConvert Version Update option

As mentioned in the recordings, only below are the set of files and folders that are generated as the output for your execution. There are no references to the actual code/logic used in your codebase in any of these files.

Image by Author — SnowConvert Files and Folders Generated

Inventories: This folder contains several .pam files. These are text files that contain information that is output from the tool. These files also contain the information used to build the output reports.

Logs: Log files contain the details on the tool’s execution. Any processes, validation, or tasks run by the tool will be printed in the log files.

Reports: Reports folder has key result of running SnowConvert in assessment mode.There are three files in this folder

  • File Inventory — This will generate an inventory for all of the files present in the input directory of the tool. This could be any file, not just the ones listed above. You will get a breakdown by filetype that includes the source technology, code lines, comment lines, and size of the source files.
  • Keyword Counts — The keyword counts file contains the counts of recognized keywords present in each file. For example, if you have a CREATE statement in a SQL file, this file will keep track of all of them. You will get a count of how many of each keyword you have by filetype.
  • Spark Reference Inventory — Finally, you will get an inventory of every reference to the Spark API present in Scala or Python code. These references will form the basis for assessing the level of conversion that can be applied to a given codebase.

Generating the Report and Performing Analysis

After your tool completes the assessment, you will receive an email from Mobilize with a subject line “Snowspark Qualification Report” as shown below. The email states whether your workload is a good candidate for migration or not. For each run, the user will get an email with a unique Session Id as seen in the below screenshot.

Image by Author — Qualification tool Email

You can reach out to your Snowflake account team, Partner SE team to to get the details on the readiness score and identify the next steps. They will help you to generate a summary and detailed report which has the migration readiness score along with how many supported and unsupported APIs are identified from your codebase. Below is the Summary Report of the assessment.

Image by Author — Summary Assessment Report

Using the reports to identify effort and estimates for workarounds

Summary and Detailed reports which your account or SE team will generate will help you in identifying the APIs which requires no or minial code change and how many APIs needs a workaround. When you read the readiness score from the above screenshot as 97.8%, it means that 97.8% of the PySpark API’s used in your code base is available in Snowpark Python and can be used with no or minimal changes.

When you further look into the detailed report as shown in the below screenshot you will find analysis based on the PySpark API usage. This report helps you identify and segregate efforts based on the API and gives you more clarity on where the effort should be more w.r.t to code re-write.

Image by Author — Detailed Assessment Report

SnowConvert for Spark Qualification Tool will accelerate the overall effort required for identifying missing APIs and enabled you to focus on getting the estimates for creating the workarounds required for the not supported ones in Snowpark Python . Below are couple of examples with workarounds for the APIs which are not supported today in Snowpark Python.

Please note do reach out to the Snowflake account team or Partner SE team to know more about the code conversion options.

Conclusion

In this blog we spoke about the high level steps involved in migrating PySpark to Snowpark Python, at what stages mitigation is required and how do you use the SnowConvert for Spark Qualification Tool to perform the code assessment and get your readiness score for your PySpark workload. This concludes our blog series on Migrating from PySpark to Snowpark Python and we hope this blog series will help you to migrate your workloads from PySpark to Snowpark Python.

References

--

--