Simplify data ingestion with Snowpark Python file access

Introduction

In my opinion the most difficult part of data engineering is data ingestion, namely, getting all your raw data in one place in a usable format for analytics and further processing. For many enterprise data sources we have great tools that help to make the job more manageable, like replication and ETL tools. But data is often captured in many different file formats which can be challenging to extract and process.

Snowflake provides native support for ingesting many different file formats, including CSV, JSON, Parquet, Avro, ORC, and XML (see file format types for more details). But what happens if you’re trying to ingest data from a file format not natively supported by Snowflake? To do this in the past, you would have to host that file ingestion process in a compute service that was external to Snowflake (Amazon EMR, Lambda, EC2, etc.). This solution came with added complexity, which generally leads to slower development lifecycles, more time spent managing the various services, and higher cost.

Accessing files with Snowpark

Enter Snowpark file access! Snowpark file access enables secure, programmatic access to files in either internal or external Snowflake stages from within Snowpark User-Defined Functions (UDFs), User-Defined Table Functions (UDTFs), or Stored Procedures! With this new Snowpark capability, you can process any type of file, stored either in an internal or external stage, directly in Snowflake and take advantage of the Python or Java language and libraries to help do so. And because the Python or Java code runs natively in Snowflake compute clusters, you no longer need to build and maintain external ingestion processes for file formats not directly supported by Snowflake. This results in faster development lifecycles, less time spent managing various cloud provider services, and more time spent adding business value.

File processing with Snowpark will enable many, many new use cases for Snowflake customers. Here are just a few of the nearly endless possibilities:

Processing specialized file formats:

Processing unstructured data:

  • Geospatial analysis against raster image files (and other data formats)
  • Performing Optical Character Recognition (OCR) on documents in PDF and image formats
  • Performing Natural Language Processing (NLP) on text from documents and emails
  • Analyzing image and video files to identify the contents and extract metadata
  • Other AI & ML scenarios with unstructured data

So while Snowpark file access will be used in many different ways, the focus for the remainder of this blog post is on Snowpark Python file access, specifically for the data engineering effort of data ingestion.

Note: Accessing files in stages with Snowpark Python or Scala is currently in Public Preview (as of July 2023), and is Generally Available for Java.

Data Ingestion with Snowpark Python

Let’s cut to the chase here. The following code samples will show two different ways to ingest a specialized file format into a Snowflake table using Snowpark Python. For this blog post, I wanted to keep things simple yet still provide valuable, real-world examples for you. So the following examples will leverage Snowpark Python to ingest data from an Excel file into a Snowflake table. While there are many specialized file formats that customers need to ingest, Excel is one of the most common.

Here are a few things to point out about the two examples that follow:

  • You will need to create a simple Excel file named “sample.xlsx” to test with. Create one with a single worksheet with two columns (“id” and “name”) and then add a few rows of made up data. Include the column headers and for “id” use integers (1,2,3,etc.) and for “name” use strings. Here’s an example:
  • You will need to create a Snowflake stage (internal is the easiest) named “DEMO_STAGE” and upload your simple Excel file there. Here’s some sample code:
CREATE OR REPLACE STAGE DEMO_STAGE;
PUT file:///your_file_path.xlsx @DEMO_STAGE;
  • The “SnowflakeFile” class is the key to reading files from a Snowflake stage
  • The “file_path” will be generated with the build_scoped_file_url() function and passed to the Stored Procedure or UDTF
  • We will use the “openpyxl” Python library to process the Excel file, which is available in our Snowflake Anaconda Channel. For more details on working with the “openpyxl” library please see the OpenPyXL Documentation.

Ingesting an Excel File via a Stored Procedure

This first example will ingest an Excel file into a Snowflake table via a user-defined Stored Procedure.

CREATE OR REPLACE PROCEDURE parse_excel_sp(file_path string)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas', 'openpyxl')
HANDLER = 'main'
AS
$$
from snowflake.snowpark.files import SnowflakeFile
from openpyxl import load_workbook
import pandas as pd

def main(session, file_path):
with SnowflakeFile.open(file_path, 'rb') as f:
workbook = load_workbook(f)
sheet = workbook.active
data = sheet.values

# Get the first line in file as a header line
columns = next(data)[0:]
# Create a DataFrame based on the second and subsequent lines of data
df = pd.DataFrame(data, columns=columns)

df2 = session.create_dataframe(df)
df2.write.mode("overwrite").save_as_table("EXCEL_SP")

return True
$$;

This very simple example reads the contents of the active worksheet into memory, converts it to a Pandas DataFrame, converts that to a Snowflake DataFrame and finally saves it to a Snowflake table named “EXCEL_SP”. To test this Stored Procedure, simply run the following:

CALL PARSE_EXCEL_SP(build_scoped_file_url(@DEMO_STAGE, 'sample.xlsx'));
SELECT * FROM EXCEL_SP;

For more details on how to build Python Stored Procedures please see Writing Stored Procedures in Python.

Ingesting an Excel File via a UDTF

This second example will ingest an Excel file into a Snowflake table via a UDTF.

CREATE OR REPLACE FUNCTION PARSE_EXCEL_UDTF(file_path string)
RETURNS TABLE (id number, name string)
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'openpyxl')
HANDLER = 'excelparser'
AS
$$
from snowflake.snowpark.files import SnowflakeFile
from openpyxl import load_workbook

class excelparser:
def process(self, stagefile):
with SnowflakeFile.open(stagefile, 'rb') as f:
workbook = load_workbook(f)
sheet = workbook.active
for value in sheet.iter_rows(min_row=2, values_only=True):
try:
yield (value[0], value[1])
except:
pass
$$;

As you can see, the core Excel processing logic is essentially the same. What’s changed here is how we structure the code to implement a UDTF. Instead of reading the entire contents of the file into memory like with the Stored Procedure, with the UDTF we’re able to directly “stream” the results in a more efficient manner. For more details on how to build UDTFs please see Writing a UDTF in Python.

To test this UDTF, simply run the following:

SELECT * FROM TABLE(PARSE_EXCEL_UDTF(build_scoped_file_url(@DEMO_STAGE, 'sample.xlsx')));

If you created the sample Excel file as described above, that command will return the exact contents of the Excel spreadsheet. And to save the results to a Snowflake table you have a number of options depending on what you’re trying to do. Here are a few examples:

-- Option #1
CREATE OR REPLACE TABLE EXCEL_UDTF AS
SELECT * FROM TABLE(PARSE_EXCEL_UDTF(build_scoped_file_url(@DEMO_STAGE, 'sample.xlsx')));

-- Option #2
INSERT INTO EXCEL_UDTF
SELECT * FROM TABLE(PARSE_EXCEL_UDTF(build_scoped_file_url(@DEMO_STAGE, 'sample.xlsx')));

-- Option #3
MERGE INTO EXCEL_UDTF AS TARGET
USING TABLE(PARSE_EXCEL_UDTF(build_scoped_file_url(@DEMO_STAGE, 'sample.xlsx'))) AS SOURCE
ON (TARGET.ID = SOURCE.ID)
WHEN MATCHED THEN UPDATE SET TARGET.NAME = SOURCE.NAME
WHEN NOT MATCHED THEN INSERT (ID, NAME) VALUES (SOURCE.ID, SOURCE.NAME);

Stored Procedures vs UDTFs

At this point you’ve seen two different ways to ingest data from a specialized file format into a Snowflake table using Snowpark Python: with a Stored Procedure and with a UDTF. And you may be asking yourself when it’s best to use one method instead of the other. Here’s a simple table to get you started making that decision for data ingestion:

+=============+===================================================================================+
| | Stored Procedure | UDTF |
+=============+=========================================+=========================================+
| Summary | Executes arbitrary code. Think about it | Returns a table of data with a |
| | like executing a script. Very simple | predefined schema. Slightly more |
| | but slightly less performant for | complicated API but more performant for |
| | loading data. Good where multiple | loading data. Good where parallelism is |
| | tasks need to be done. | needed. |
+-------------+-----------------------------------------+-----------------------------------------+
| Usage | Must be executed by itself with CALL, | Can be used directly in other SELECT or |
| | can not be used directly in other | DML statements, just like a table. |
| | SELECT or DML statements. | |
+-------------+-----------------------------------------+-----------------------------------------+
| Snowpark | Can access the Snowpark session object | Can not access the Snowpark session |
| Session | and run DataFrame or SQL queries | object, and can not run DataFrame or |
| | against Snowflake. | SQL queries against Snowflake. |
+-------------+-----------------------------------------+-----------------------------------------+
| Performance | Each statement/query will execute | Can be used in a query to process |
| | sequentially. | multiple files in parallel. |
+-------------+-----------------------------------------+-----------------------------------------+
| File Schema | Can handle dynamic file schemas, don’t | Schema of the table must be explicitly |
| | need to define them ahead of time. | defined in return type. Requires a |
| | | strict mapping between files and their |
| | | target schema/table. |
+-------------+-----------------------------------------+-----------------------------------------+
| Table | Can load data into multiple tables, | Can only load data to a single |
| Loading | with different schemas, in the same | schema/table, but can do so in |
| | procedure. | parallel. |
+-------------+-----------------------------------------+-----------------------------------------+

Note: Accessing files with Snowpark Python also allows for processing files from within a UDF. A UDF is different from a UDTF in that it returns a scalar value and is generally used in the context of a single row/record. One example of using a UDF on files would be to pass row/record attributes to a data science model and return the result. But since the focus of this blog is on data engineering, and specifically data ingestion, maybe more detail about UDF use cases is for another blog post in the future.

Note: If you’re looking to ingest larger files, while maintaining the flexibility of the Stored Procedure, you might consider combining the two options here: move the file processing logic into a UDTF and then call the INSERT/MERGE DML from the Stored Procedure.

Conclusion

If you can’t tell, I’m really excited about processing files with Snowpark Python as it will enable many, many new use cases for Snowflake customers. And in particular, it will allow customers to ingest data from specialized file formats without needing to build and maintain external file ingestion processes. This results in faster development lifecycles, less time spent managing various cloud provider services, less cost, and more time spent adding business value.

If you found this interesting and want to try out other Snowpark file processing use cases, please check out these Quickstarts:

I hope this helps you get started, and please share any cool use cases you come up with that I haven’t covered here!

--

--

Jeremiah Hansen
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

I’m currently a Field CTO Principal Architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.