Accelerate DocumentAI Classification with Snowflake Cortex LLMs

Sorting documents for post-processing is easy with Snowflake Cortex LLMs!

Introduction

DocumentAI is a Snowflake feature that can extract structured data from unstructured objects. With DocumentAI, you can continuously process new documents of a specific type, like invoices or forms. Powerful stuff!

DocumentAI models are all built by you in your own account. You can create multiple builds for multiple use cases. For example: one model may find the address, phone number and name from a check-in form while another may pick out machine part IDs and shipping destinations from a manifest. This provides the most accuracy as each model is very specific.

But if you have a set of documents, with multiple categories in it, how can you decide which DocumentAI model works best? Enter Snowflake Cortex.

In this post I will show you how you can use Snowflake Cortex LLMs to classify a document into a category, and then apply the most accurate DocumentAI model to that document for text extraction.

What is DocumentAI?

Document AI uses Arctic-TILT, a proprietary large language model, to provide zero-shot extraction. You can then fine-tune the model to improve results by training the model on the documents specific to your use case.

To work with DocumentAI there are two steps: preparing the model build and extracting information from documents. You prepare the model build through a GUI in Snowsight. This lets you create a model build and upload documents to test and train the model. When the model build is done, you can extract information from documents by running an extracting query. You can then automate the extracting query with streams & tasks.

This is a typical workflow for building and using a DocumentAI model:

What is Snowflake Cortex?

Snowflake Cortex is a suite of AI features that use LLMs to understand unstructured data, answer questions, and provide intelligent assistance. Today we’ll be focusing on Cortex LLM functions. Cortex LLM functions gives you instant access to industry-leading LLMs trained by companies like Mistral, Reka, A21 Labs and more. Since these LLMs are fully hosted and managed, using them requires no setup. Your data stays within Snowflake, giving you performance, scalability, and governance.

The particular LLM I will be using is jamba-instruct, built by AI21 Labs. It is optimized to offer a 256k token context window with low cost and latency, making it ideal for tasks like summarization, Q&A, and categorization on lengthy documents. This is the perfect LLM for our use case.

Defining the Use Case

Say that there exists a pipeline that lands PDFs into an object storage location. There is an known number of categories of documents but they are all mixed together in the bucket. The names of the file are similar, and the content of the file is what separates the categories from one another.

Today, it requires manual work to read each file, check the contents, sort it and then extract information from it. This takes time, effort and results in falsely labeled and extracted documents over time. A human in the loop has to categorize the document by reading it, and then move it to another pipeline. Then another human has to go through a data-entry process to turn unstructured data into structured data. Can you imagine how many hours and valuable dollars could be saved by automating this process!

If we applied a single DocumentAI extraction model on the entire set, it wouldn’t be very accurate, as it’s too general. A model tuned for financial services document would run terribly on a machine parts order form. Creating a distinct DocumentAI model for each category will provide a higher accuracy. How can we categorize each file, without having to involve a human to read through each one to do so?

I propose using the new Cortex Parse function to transcribe the PDF into raw text, leverage Jamba-Instruct hosted by Cortex LLM to categorize the text and finally create separate stages for each category. Then use the COPY Files command to move the files into their respective stages, train a specific DocumentAI model for each category and apply that model for extraction into a final table. Finally, we can automate the whole pipeline.

I know that sounds like a lot, but because we can do this all with simple SQL, it’s going to be a breeze! I’ve broken this down into three sections.

Creating the Staging Categories with Cortex

First, we need to transcribe these documents from PDFs into a LLM-readable format. We could use a Snowpark Python UDF with the PyPDF2 library, point that at a stage and have a raw text table as output. Luckily Snowflake has made this process simpler with the Cortex Parse function.

We can use the following code block to create a table with the relative path, file name and fully transcribed text of every file in our landing stage:

--Assume a stage called DOCS_BASE that acts as our landing stage
--DOCS_BASE has to have SNOWFLAKE_SSE as the encryption option
CREATE OR REPLACE TABLE RAW_TEXT AS
SELECT
RELATIVE_PATH
, FILE_URL
, SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@DOCS_BASE, RELATIVE_PATH) as RAW_TEXT
from DIRECTORY(@DOCS_BASE);

Next, we need to use jamba-instruct to classify this raw text into a set of categories. This is where prompt engineering techniques, like providing good and bad examples, can result in an accurate and consistent output. We can just use SQL and natural language, because this is all serverless!

--Other models can be used in place of jamba-instruct as well
CREATE OR REPLACE TABLE CLASSIFIED_TEXT AS
SELECT RELATIVE_PATH, SNOWFLAKE.CORTEX.COMPLETE('jamba-instruct',
CONCAT('You are an expert document classifier.
Classify the following Document into one of the following Buckets:
"snowflake use case description",
"industrial repair manual",
"financial report",
Output the Bucket name with nothing else,
Response should not include any explanation for the Bucket,
Document: ', RAW_TEXT:content))
as OUTPUT
FROM DOCUMENTS_CLASS.DOCS.RAW_TEXT;

The result is a new table with the file name and the category from jamba.

Sometimes there are either extra spaces or punctuation, so we’ll use a simple Regex Expression to remove those. We can also turn spaces between words into underscores, which improves readability later on.

CREATE OR REPLACE TABLE CLEANED_VALUES AS
SELECT RELATIVE_PATH,
UPPER(REGEXP_REPLACE(
REPLACE(LOWER(output), ' ', '_'),
'[^a-z0-9_]', -- Keep only alphanumeric characters and underscores
''
)) AS CLEANED_COLUMN
FROM DOCUMENTS_CLASS.DOCS.CLEANED_TEXT;

We now have a category for each document with no manual effort at all! The PDFs were parsed and categorized all with a few lines of SQL.

Apply a DocumentAI Model for Each Category

We’ll now need to create a separate stage for each file category, so we can train a distinct DocumentAI model for each . We can predefine the stages or use a stored procedure to do this from the cleaned PDF categories table:

CREATE OR REPLACE PROCEDURE create_stages_for_cleaned_values()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
v_cleaned_value STRING;
cur CURSOR FOR SELECT RELATIVE_PATH,
UPPER(cleaned_column) as cleaned_column --Our cleaned columns from above
FROM cleaned_values;
BEGIN
FOR rec IN cur DO
v_cleaned_value := rec.cleaned_column; --For each column, a stage
EXECUTE IMMEDIATE 'CREATE STAGE IF NOT EXISTS ' || v_cleaned_value || '';
END FOR;
END;
$$;
CALL create_stages_for_cleaned_values();
SHOW STAGES; --We should now have a stage per category created

Snowflake recently released the COPY Files command, which enables you to move staged files between different Snowflake stages. We can use a stored procedure to move each file to it’s respective stage, based on matching the category name with the new stage name:

CREATE OR REPLACE PROCEDURE move_files_to_stages()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
file_name STRING;
stage_name STRING;
sql_command STRING;
file_cursor CURSOR FOR
SELECT relative_path, cleaned_column as
cleaned_column FROM cleaned_values;
BEGIN
-- Loop through each record in the file_stage_table
FOR rec IN file_cursor DO
file_name := rec.relative_path;
stage_name := rec.cleaned_column;

sql_command := 'COPY FILES INTO @' || stage_name || --Move the files
' FROM @docs_base FILES = (''' || file_name || ''')'; -- To stage

EXECUTE IMMEDIATE sql_command;
END FOR;
END;
$$;
CALL move_files_to_stages(); --Each file should now be moved to its stage

I will not be going through DocumentAI training in this blog (this blog can’t be too long!), but you can refer to this quickstart if you’d like to learn more. You will create multiple model builds by uploading training files from your local directory, then training the models. Lastly, publish them.

Let’s say there are now multiple DocumentAI models, one for each file type. Each model is highly accurate as the extraction questions are tailored for each specific use case. Importantly, the DocumentAI model names match the predefined categories exactly, with just ‘_BUILD’ at the end to make the future steps simpler. Make sure you do this!

Now, we should create a new Extracted Values table for each category, based on the name from the particular stage for the category. This table will also include a JSON output of the extraction from DocumentAI.

Each DocumentAI build has an associated stored procedure called the “extracting query”. This comes in the format <model_build_name>! PREDICT() and is applied to a stage, creating a JSON object with the extracted values as an output. This procedure will find the pre-trained model_build for each category and use the PREDICT() procedure on the category stage. So for our example before financial_report could be the stage and then financial_report_BUILD would be the extracting model.

Notice the extract stored procedure run to land in ‘json_content’:

CREATE OR REPLACE PROCEDURE create_tables_for_all_stages()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
stage_record STRING;
stage_name STRING;
table_name STRING;
build_name STRING;
sql_command STRING;
cur CURSOR for SELECT stage_name
FROM INFORMATION_SCHEMA.STAGES WHERE stage_name != 'DOCS_BASE';
BEGIN
FOR stage_record IN cur
DO
-- Create the stage name directly
stage_name := UPPER(stage_record.stage_name);

-- Create the table name based on the stage name
table_name := UPPER(stage_record.stage_name) || '_TABLE';

-- Create the biuld name based on the stage name
build_name := UPPER(stage_record.stage_name) || '_BUILD';
-- Construct the SQL command to create the table
sql_command := 'CREATE TABLE IF NOT EXISTS ' || table_name || '
AS (SELECT
file_name,
size as file_size,
last_modified,
file_url,
' || build_name || '!PREDICT(get_presigned_url(@'|| stage_name ||',
RELATIVE_PATH)) AS json_content
from DIRECTORY(@' || stage_name || '));';
-- Execute the SQL command
EXECUTE IMMEDIATE sql_command;
END FOR;
END;
$$;
CALL create_tables_for_all_stages();
SHOW TABLES;

To recap, we moved each file into a separate stage, defined by it’s category. Then applied its best fitting DocumentAI model to it, leaving us with a very accurate extraction table. All with some SQL and UI actions in Snowsight.

Automating the Pattern for Scalability

Snowflake makes scaling pipelines easy with a variety of options. There are user-managed tasks, triggered tasks, dynamic tables, streams and more.

For this use case, this is what’s needed:

  • Stream on the landing stage to let our tasks know there is a new file.
  • The first task will transcribe the file and then categorize with Cortex. This task will use the landing stage stream as its input data and insert the new data to the output cleaned_values table that we created earlier.
  • Stream on our cleaned_values table to let our stage category task know that there is new data. This will include data on the stage_name.
  • The second task will sort the file into it’s appropriate category stage. This task will use the cleaned_values stream for finding new files.
  • Streams on each of the category stages for the final processing tasks.
  • Tasks for each category that use the streams on the category stages to insert new data to the final extracted values tables from the stage.

A quick diagram just to visualize what’s happening here:

And that’s it! Of course you can add more validation steps, historical monitoring and other components to this for more production readiness.

Closing Thoughts

In this blog I showed you how you can leverage two of Snowflake’s exciting AI features to eliminate manual processing, sorting and extraction of files. I hope this helps you see the power of the Snowflake Data Cloud and I’m excited to see more potential use cases that incorporate this framework.

Thank you so much for making it to the end and feel free to leave comments or additional thoughts on this project!

Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer (Snowflake).

--

--