Snowpark or Pyspark for Local Environment Setup?

In the realm of data engineering, where the ocean of information meets the shores of analytics and processing, tools like Snowflake Snowpark API and Apache PySpark stand as lighthouses, guiding practitioners through the waves of big data challenges.

For industry newcomers embarking on their data engineering journey, navigating these technologies can be daunting. Fear not, as this article serves as your compass, guiding you through the initial steps of setting up a local environment for Apache PySpark and Snowflake Snowpark. Whether you’re aiming to wrangle massive datasets or unlock the insights hidden within, this guide will illuminate the path forward, ensuring a smooth voyage into the world of data engineering.

Comparing Initial Setup

Assuming that the primary IDE will be Microsoft VSCode

Snowpark for Python

You can create a Python virtual environment for a particular Python version using tools like Anaconda, Miniconda, or virtualenv.

First, we create a virtual environment, add the Snowflake conda channel, and install the numpy and pandas packages. Creating a new conda environment locally with the Snowflake channel is recommended to have the best experience when using UDFs.

conda create --name py310_env --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.10 numpy pandas pyarrow

Then, we install the Snowpark Python package into the Python virtual environment along with the jupyter notebook

pip install snowflake-snowpark-python
pip install notebook

That’s it! We are now good to start with our first Snowpark Python Project

Apache Pyspark

For Apache pySpark setup, we need to download a couple of things and then update the environment variable paths. These are:

  • Java Download and Installation
  • Apache Spark Download and Extraction
  • Winutils Download

Once we have completed the above installation and extraction, we have to set some environment variables in our local machine. Those are:

HADOOP_HOME = C:\Users\divya\Downloads\spark
JAVA_HOME = C:\Program Files\Java\jdk-22
PYSPARK_PYTHON = C:\Users\divya\anaconda3\python.exe
SPARK_HOME = C:\Users\divya\Downloads\spark

Path should include:
%HADOOP_HOME%\bin
%JAVA_HOME%\bin
%SPARK_HOME%\bin

Once the above variables are set, then you need to add the pyspark package.

pip install pyspark

Note: You need to change the above directory paths based on your local system. Along with that, you need to add the winutils.exe file inside the spark\bin folder. If it throws any error, try rebooting your machine OR check for the solution on Google ;)

Building Session

While creating a session in pySpark or Snowpark, the syntax is almost the same. You need to import SparkSession from pyspark.sql while for snowpark, you need to import Session from snowflake.snowpark.

Snowpark Local Testing Framework allows you to create and operate on Snowpark Python DataFrames locally without connecting to a Snowflake account.

Dataframe Playground and Transformations

You can create dataframes using createDataFrame() in Snowpark and pyspark. From the above snapshot, you can see that using the Snowpark local testing framework provides great performance. While the pyspark took sufficient time to create a DF, for the snowpark, it will less than a second.

Both Pyspark and Snowpark allow you to select columns using the dot operator, [], and col() function. In pyspark, the column names are case-sensitive.

You can import functions in Snowpark using SNOWFLAKE.SNOWPARK.FUNCTIONS while for pyspark, you need to use PYSPARK.SQL.FUNCTIONS

Pyspark supports both, withColumn() and withColumns() but snowpark currently does not support withColumns()

Filters, sorting, and aggregations remain the same in terms of syntax for both pySpark and Snowpark.

You can define a schema in both pySpark and Snowpark. All you have to do is, you need to import StructField, StructType, and DataTypes from SNOWFLAKE.SNOWPARK.TYPES or PYSPARK.SQL.TYPES

In Snowpark, the process of reading a locally available file(for instance CSV) is different. You need to load the file inside a temp stage using SESSION.FILE.PUT() then use that stage location path to read the file.

Also, Snowpark currently does not support infer schema in the local testing framework, however, you can use it at the time of Snowflake account connection.

Both Snowpark and Pyspark offer to create a temporary view of a dataframe. However, Snowpark does not allow it while using a local testing environment. You can still use it with the Snowflake account connection configuration.

You can easily perform Joins and Union operations on dataframes in Snowpark and Pyspark. Snowpark does not allow you to join a dataframe with itself, so you need to resolve it by copying it as another dataframe.

Key Conclusions

The conclusions are based on my individual opinion and POV. It may vary depending on different readers and environments.

In terms of Initial Setup:

The Snowpark is good and very easy to set

In terms of Performance:

Again Snowpark is faster when we tested with Pyspark on a local setup

For Data Transformation Capabilities

Both Snowpark and PySpark provide a wide range of functions for df transformation

Snowflake Snowpark API provides a powerful and scalable framework for data engineering tasks, allowing you to process and transform large volumes of data efficiently. By mastering Snowpark, you can become proficient in handling big data and build robust data engineering pipelines.

Remember to consult the official Snowflake Snowpark documentation for detailed information, explore additional features and advanced techniques, and stay updated with the latest developments in the Snowpark ecosystem.

Additionally, consider joining online communities, forums, and attending data engineering conferences to connect with fellow practitioners, learn from their experiences, and stay up to date with industry trends.

About Me:

Hi there! I am Divyansh Saxena

I am an experienced Cloud Data Engineer with a proven track record of success in Snowflake Data Cloud technology. Highly skilled in designing, implementing, and maintaining data pipelines, ETL workflows, and data warehousing solutions. Possessing advanced knowledge of Snowflake’s features and functionality, I am a Snowflake Data Superhero & Snowflake Snowpro Core SME. With a major career in Snowflake Data Cloud, I have a deep understanding of cloud-native data architecture and can leverage it to deliver high-performing, scalable, and secure data solutions.

Follow me on Medium for regular updates on Snowflake Best Practices and other trending topics:

Also, I am open to connecting all data enthusiasts across the globe on LinkedIn:

https://www.linkedin.com/in/divyanshsaxena/

New Ways To Stay Connected

I’ve been getting a lot of DMs for guidance, so decided to take action on it.

I’m excited to help folks out and give back to the community via Topmate. Feel free to reach out if you have any questions or just want to say hi!

https://topmate.io/divyansh_saxena11

--

--