Snowpark Fast & Furious: Streamlining your Data Pipelines

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readFeb 6, 2022

Photo by Marc-Olivier Jodoin on Unsplash

Last week Snowpark Scala and Java UDFs were made GA. This is an incredible step forward to streamline architectures, build scalable and optimized data pipelines and enforce security and governance across all workflows.

My colleague Mats Stellwall made an incredible work documenting in his blog post Feature Engineering with Snowflake how to use Snowpark Scala. That was based on the book Reproducible Machine Learning for Credit Card Fraud Detection that explains some feature engineering techniques to enrich your data set and provide better results in your ML algorithm. To accomplish this, the book provides some examples using Python to create those transformations. Running the Python code proposed in the book to create the features in a Jupyter notebook takes a large amount of time, as the data needs to be collected, transformed, and them stored again in the database:

It would be very common to create one Spark cluster to perform those transformations. That would mean bringing your data out of the data warehouse, performing all transformations to create the new features and writing all the data back again. Obviously before that you need to create your Spark cluster, which will require some expertise to make the right choices. Here one example of options to be decided when creating the Spark cluster:

Typically the highest configuration will be chosen to avoid any errors as troubleshooting it will be a very painful and time consuming effort. At the end, running those transformations with Spark clusters means moving data around and having more security controls to manage. This figure try to explain that:

The Snowpark name may be confusing because so many times I have been asked: “so, you now have a Spark cluster within Snowflake running those transformations”. That is wrong. The beauty of Snowpark is that it gets rid of all that complexity. Snowpark does NOT use any Spark cluster. It just pushes down all the transformations to the Snowflake platform. Therefore, there is nothing else to manage and Snowflake governance and security is applied. It is incredibly efficient because data does not leave Snowflake. It uses the agility, flexibility and power of Snowflake compute (virtual warehouses). In contrast to Spark clusters that will need a few minutes to be activated when needed, Snowflake warehouses are enabled in no time. Also, Snowflake warehouses can be suspended after one minute of inactivity, while Spark uses a minimum of 10 minutes.

Going back to Mats´ example, the whole transformation is happening in just 10 seconds as we can see in this figure. This is a huge improvement compared with the 202 seconds we had before. The key thing here is using the analytics function of Snowflake like Windows within Snowpark and all transformations happening within the Snowflake platform.

Because of the lazy behaviour of Snowpark, all the data frame definitions are happening without executing anything at snowflake. It is just when writing the last DataFrame into a table when Snowflake uses the compute power to execute the query. This graph shows all the actions taken, where all the aggregations, window functions, joins, etc are happening:

Before executing the transformation the power needed can be selected, and use the simplicity and elasticity of Snowlake to resize the number of nodes in no time (from X-Small to 5X-Large) and only paying for the seconds it is being used. This is a complete game changer. Snowpark dramatically simplifies the architecture:

While Mats used Snowpark Scala, the latest example has been done using Snowpark Python that is now in private preview. Similarly to Scala, Snowpark Python uses the DataFrame concept. This is also another game changer because of Python´s popularity for both Data Engineering and Machine Learning workloads. Snowpark Python UDFs also enable inference within Snowflake. Here you have an example of how to write a permanent UDF within Snowflake using Python:

This another demo from past Snowday “Python on Snowflake” shows Python in action using IPinfo dataset from the Data Marketplace to build features and Python UDFs to perform inference.

It is time to re-think and re-architect those legacy architectures and use the power of the Data Cloud to streamline them, build scalable and optimized data pipelines and enforce security and governance across all workflows.

Let´s Snow!

Snowpark Fast & Furious: Streamlining your Data Pipelines

Written by Carlos Carrero