Sitemap

Snowpark in Snowflake

5 min readMay 17, 2023

--

Some businesses are exploring Snowpark and Snowflake and recommending their data engineers/scientists/analysts to abandon DataBricks. To use Snowpark successfully, you must first understand its syntax.

Snowpark provides an easy-to-use API for accessing and processing data in a data pipeline. You may use the Snowpark library to create apps that handle data in Snowflake without transferring it to the machine where your application code runs. You may also automate data transformation and processing by creating stored procedures and scheduling them as Snowflake tasks.

Snowpark does not optimize or speed up execution beyond what you can do with the Snowflake SQL engine. Performance is similar. However core by core comparison, Snowflake SQL engine is far faster than Spark on regular ETL or Query workloads. 2–10x is what we see on average on similar size compute.

Snowflake’s Streamlit integration will combine Streamlit’s simplicity of use and flexibility with Snowflake’s scalability, managed data coverage, and security, allowing developers to create powerful applications without the usual complexities of developing and deploying web apps.

The connection will let developers to utilise Python to construct apps with Snowflake data, deploy and execute these applications on Snowflake’s controlled platform and share these applications with business teams to utilise data and ML models.

Snowpark accomplishes two important tasks:

1. Enables you to execute data engineering activities using Python and dataframes without utilising SQL in your code. Snowpark dataframe methods are quite comparable to PySpark in that 80–90% of your code will stay unchanged if you decide to switch.

Snowpark dataframes are run remotely on the serverless MPP compute clusters of Snowflake. This means that the Python environment in which the code is running has no effect on the actual execution performance, regardless of how much data is being processed or how small/slow the machine running the code is (local laptop, Jupiter notebook, free cloud notebook like colab), they will run exactly the same as Snowflake does all the computation. Snowpark does this by converting dataframe operations to ANSI,-SQL in a lazy execution paradigm and sending them to Snowflake for execution.

You also have access to clusters that can start, stop, scale up or down in seconds on average, and the compute time you pay for is just for as long as your job runs. Snowflake does not have this issue and will allow you to establish clusters with hundreds of nodes instantly in less than a second. Run your dataframe operations, then auto-shutdown in 1 second after the code finished processing, resulting in significant cost savings.

2. What happens when your dataframes do actions that SQL cannot? As an example, consider executing a Python code that calls the NLTK library to do sentiment analysis. Snowpark will package up your custom Python function code as well as all third-party libraries, upload them to Snowflake, and register them as user-defined functions in this scenario. Dataframes will then employ these functions as part of their operations, where the Python code runs directly on Snowflake compute clusters and is parallelized using all cores in a cluster. The faster it runs, the larger the cluster. Your code does not need to be configured, tweaked, or optimised.

Snowpark may make the Python functions it registers while conducting dataframe operations permanent. In this situation, they are not erased at the conclusion of the process and may be utilised by any SQL savvy person or BI tool against any data set in the future.

Python support has been introduced at both the query planner and execution engine levels, so the Snowflake query planner is fully aware of the nature of the UDF and how to optimise for it.

When dealing with Snowpark in Snowflake, keep the following points in mind:

  1. It was promoted as allowing you to do data-related work using Python inside Snowflake, allowing you to access Snowflake compute and your data without needing to relocate data outside of Snowflake for data analysis and data science work.
  2. You cannot execute pure Python, Pandas, Numpy, or any other library code in the snowflake environment. You must first rewrite existing code in Snowpark syntax before running it on Snowflake if you want to use Snowpark with it.
  3. Snowpark dataframe and Pandas dataframe are not the same thing. Snowpark is handy for connecting to Snowflake, retrieving data, saving that data as a Pandas’s dataframe (which violates the point of Snowpark), conducting work in pandas/python, and then using Snowpark to transport the resultant data and tables back to Snowflake if necessary.
  4. When it comes to obtaining data from Snowflake and putting data back into Snowflake, it’s better and easier than SQL alchemy.
  5. In Snowpark, a UDF is similar to a snowflake function written in Python rather than SQL. It is not strong. A parameterized UDF is inherently difficult to implement.
  6. They give a basic UDF as an “example,” but when you try to apply it in a real-world application, you will find it lacking. A UDF that requires handing in column names as input to compute a result, for example, will be almost hard to construct using Snowpark.
  7. Limitation — When you want to migrate to a different system, you can’t just run your Snowpark code in the new system the same way you can run regular Pandas or Python code in any Python system.
  8. Dependency — Snowpark syntax NEEDS Snowflake to initialize and utilize. Opposed to Pandas or other Python library you can execute without Snowflake.

Snowpark is a query builder, not a replacement for Spark. It can only do a few more activities that need the usage of third-party libraries.

Snowflakes Python support appears to be ineffective. Data engineers/scientists/analysts wanted to run several ML models directly because the IO for taking data out of Snowflake was high and they didn’t give the libs we needed.

Additional Tips: If your idea of data being secure is —

  1. You have to build the right IAM rules so no-one can access the parquet files on the cloud storage directly outside of lakehouse platform.
  2. You have to configure additional encryption using my keys for storage.
  3. You have to configure an additional UNITY catalog service
  4. You have to apply RBAC Rules on every piece of data
  5. You have to make sure the clusters being used are the proper version + configured properly so they don’t ignore RBAC rules & expose all data to everyone.
  6. You have to configure the platform so users are not able to create or modify their own clusters to avoid creating cluster types that do not support RBAC.

--

--

Deep Arjun
Deep Arjun

Written by Deep Arjun

Data Engineer | Python | PySpark | SQL | Kafka || AWS || GCP || Azure

Responses (1)