How to use Snowflake to join pdf/docx documents with other tables

Label vector created by alvaro_cabrera — www.freepik.com

Snowflake recently launched unstructured data support in public preview. With that functionality, it is now possible to store, govern, catalog, process, and share unstructured data directly within Snowflake.

One of the use cases that I commonly hear from customers is analyzing pdf or word documents or enriching them by joining with other datasets. More specifically, use cases are

  • Search for all documents related to a particular user_id.
  • Find the approval emails for a particular transaction.
  • Join the invoices dataset with the supplier table, and enrich the datasets.
  • and many more …

Often all these use cases require the processing of unstructured pdf files. Depending on the use case, the data engineer can pre-process the pdf files and store the results in a target table, or the data engineer can simply leave the documents as-is for processing on the fly.

You can easily implement either of these architectures using Snowflake. Let’s take the first example. Imagine a simple scenario where a customer has a lot of invoices (docx format) stored in Snowflake. The customer wants to extract invoice_id from the docx files and use that field to join with various other datasets. The invoice_id is of this format. BDT/XXX (example “BDT/056”)

Now customers can do this join between the docx files with other tables in 2 steps

  1. Process the docx files once and extract the invoice_id by reading the files. Store the invoice_id in a result table.
  2. Join the result table with other tables based on invoice_id field.

Step 1 can be done by writing a java user-defined function to process docx file and extract the invoice_id.

Note: This functionality is in limited private preview right now. If you are interested in trying out, please reach out to your Snowflake account team.

Here is a simple java function. I am using Apache Tika to read the file and extract the invoice_id. Compile the java function into a jar.

import java.io.*;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class DocumentParser {

public static String FindInvoice(InputStream inputStream) throws IOException, TikaException, SAXException {
ContentHandler handler = new ToXMLContentHandler();

try {
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
parser.parse(inputStream, handler, metadata);
var paragraphs = handler.toString().split("<p>|</p>");
for (var p : paragraphs)
{
if (p.replaceAll("[\\n\\t ]", "").contains("BDT/")){
return p.replaceAll("[\\n\\t ]", "");
}
}
}
catch (SAXException exception){
System.out.println("Exception thrown :" + exception.toString());
throw exception;
}

return null;
}

On the Snowflake side, the SQL code looks like this

use database sshah;
use schema unstructured_data;
use warehouse sshah;
create stage sample_invoice_documents
encryption = (type = 'snowflake_sse')
directory = (enable = true);
-- run these commands in Snowsql to upload files to the Snowflake stage. --put file:///Users/sshah/Downloads/0_Inv55.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/1_Inv55.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/0_Inv56.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/1_Inv56.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/0_Inv57.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/0_Inv59.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/1_Inv60.docx @sample_invoice_documents auto_compress = false;
--put file:///Users/sshah/Downloads/0_Inv61.docx @sample_invoice_documents auto_compress = false;
-- endalter stage sample_invoice_documents refresh;
select * from directory(@sample_invoice_documents);
-- Create java udf to find the invoice_id from the docx file
create or replace function find_invoice(file string)
returns string
language java
imports = ('@jars_stage/DocumentParser.jar', '@jars_stage/tika-app-2.0.0.jar')
HANDLER = 'DocumentParser.FindInvoice'
;
-- Test the function
select find_invoice('@sample_invoice_documents/0_Inv61.docx');
-- returns "BDT/061"
create or replace table invoice_ids as
select
find_invoice('@sample_invoice_documents' || relative_file_path) as invoice_id
, file_url
from
directory(@sample_invoice_documents');
-- Now join invoice_ids table with other tables

That’s it. It is super simple to store and process unstructured data (like docx in this example) using java user-defined functions. Again, this functionality is in limited private preview, so please reach out to your Snowflake account team if you are interested.

--

--

Saurin Shah
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Product @ Snowflake. Passionate about building products to help customers derive more value out of their data