Building Data Engineering Pipelines with Snowpark for Python

Introduction

“Data engineers are focused primarily on building and maintaining data pipelines that transport data through different steps and put it into a usable state … The data engineering process encompasses the overall effort required to create data pipelines that automate the transfer of data from place to place and transform that data into a specific format for a certain type of analysis. In that sense, data engineering isn’t something you do once. It’s an ongoing practice that involves collecting, preparing, transforming, and delivering data. A data pipeline helps automate these tasks so they can be reliably repeated. It’s a practice more than a specific technology.” (From Cloud Data Engineering for Dummies, Snowflake Special Edition)

Data Engineering is a very broad discipline that has been around as long as we’ve had data to work with. Some today talk about data engineering as if it’s a relatively new thing or as if a certain type of data engineering (Python DataFrames against a legacy file-based data lake) is the best. I’ll have more to say about that in a future blog post. But the reality is that there are many different tools and techniques for doing data engineering.

The purpose of this blog post is to highlight the benefits of using Snowflake for Python data engineering as well as to introduce my new related Quickstart Data Engineering Pipelines with Snowpark Python. That Quickstart provides step-by-step instructions and detailed explanations for how to build data engineering pipelines on Snowflake with Snowpark Python.

Snowpark Python

Snowpark is a collection of Snowflake features which includes native language support for Java, Scala and Python along with a client-side DataFrame API (with 100% push down to Snowflake). Snowpark Python includes the following exciting capabilities:

  • Python (DataFrame) API
  • Python Stored Procedures
  • Python Scalar User Defined Functions (UDFs)
  • Python UDF Batch API (Vectorized UDFs)
  • Python Table Functions (UDTFs)
  • Integration with Anaconda

With Snowflake’s Snowpark Python capabilities, you no longer need to maintain, secure and pay for separate infrastructure/services to run Python code as it can now be run directly within Snowflake’s Enterprise grade data platform! In addition to those benefits, customers moving from Spark to Snowpark are consistently finding that Snowpark is both faster and cheaper than Spark. For more details check out the Snowpark Developer Guide for Python.

Benefits of Snowflake for Python Data Engineering

Snowflake provides many, many benefits for Python Data Engineering over using Spark against a legacy file-based data lake. During the Quickstart you will get hands-on experience with these features and tools and understand how they benefit you as a data engineer! In particular, you will learn about the following Snowflake features and tools during the Quickstart:

  • Snowflake’s Table Format (more mature than recent formats)
  • Data ingestion with COPY (highly scalable and performant ingestion)
  • Schema inference (for loading Parquet and other types of data)
  • Data sharing/marketplace (instead of ETL)
  • Streams for incremental processing (CDC)
  • Streams on views (advanced CDC capability)
  • Python UDFs (with third-party packages)
  • Python Stored Procedures
  • Snowpark DataFrame API
  • Snowpark Python programmability (running natively in Snowflake)
  • Warehouse elasticity (dynamic scaling)
  • Visual Studio Code Snowflake native extension (PuPr, Git integration)
  • SnowCLI (PuPr)
  • Tasks (with Stream triggers)
  • Task observability (orchestration tool monitoring)
  • GitHub Actions (CI/CD) integration

Before moving on to the Quickstart, let’s take a quick look at two exciting new developer tools for data engineers: the Snowflake Visual Studio Code Extension and the SnowCLI Tool .

Introducing Native VS Code Extension

The Snowflake Visual Studio Code Extension provides a Snowflake built, native extension for Visual Studio Code (VS Code for short). VS Code is one of (if not the most) popular IDE out there, and is a free, open-source cross-platform editor. It’s also my favorite IDE. The extension currently provides the following capabilities:

  • Snowflake SQL Intellisense
  • Accounts & Session Management
  • Database Explorer
  • Query Execution
  • Query Results & History

With this extension you’re now able to do your development entirely within VS Code. In addition to not flipping between applications, this also allows you to take advantage of VS Code’s native Git support and any other available extensions (like Microsoft’s great Python extension). Here’s a quick visual overview of the extension:

For more details, please check out the Snowflake Visual Studio Code Extension marketplace page.

Note — As of 2/10/2023, the Snowflake Visual Studio Code Extension is in preview.

Introducing SnowCLI

The SnowCLI tool is an exciting new command line tool for developers, and is executed as snow from the command line.

Note — Do not confuse this with the SnowSQL command line tool which is a client for connecting to Snowflake to execute SQL queries and perform all DDL and DML operations, and is executed as snowsql from the command line.

SnowCLI simplifies the development and deployment of the following Snowflake objects:

  • Snowpark Python UDFs
  • Snowpark Python Stored Procedures
  • Streamlit Applications

The Quickstart will be focused on the first two. And for Snowpark Python UDFs and sprocs in particular, the SnowCLI does all the heavy lifting of deploying the objects to Snowflake. Here’s a brief summary of the steps the SnowCLI deploy command does for you:

  • Dealing with third-party packages
    - For packages that can be accessed directly from our Anaconda channel it will add them to the PACKAGES list in the CREATE PROCEDURE or CREATE FUNCTION SQL command
    - For packages which are not currently available in our Anaconda channel it will download the code and include them in the project zip file
  • Creating a zip file of everything in your project
  • Copying that project zip file to your Snowflake stage
  • Creating the Snowflake function or stored procedure object

This also allows you to develop and test your Python application locally, without having to worry about wrapping it in a corresponding Snowflake database object. It also provides a command line tool that can be used in a CI/CD pipeline to automate the deployment of your objects (which is covered in the Quickstart). Stay tuned for more to come from the SnowCLI tool!

Note — As of 2/10/2023 the SnowCLI tool is in preview.

All the Details

So, are you interested in unleashing the power of Snowflake and Snowpark Python to build data engineering pipelines? Well then, my new Data Engineering Pipelines with Snowpark Python Quickstart is for you! Unlike our previous Snowpark Quickstarts, the focus for this one is on building data engineering pipelines with Python, and not on data science. For a great example of doing data science with Snowpark Python please check out our Machine Learning with Snowpark Python: — Credit Card Approval Prediction Quickstart.

The Quickstart will cover a lot of ground, and by the end you will have built a robust data engineering pipeline using Snowpark Python stored procedures. That pipeline will process data incrementally, be orchestrated with Snowflake tasks, and be deployed via a CI/CD pipeline. You’ll also learn how to use Snowflake’s new developer CLI tool and Visual Studio Code extension! Here’s a quick visual overview:

Have fun, and please share any cool learnings/examples you come up with!

--

--

Jeremiah Hansen
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

I’m currently a Field CTO Principal Architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.