Document Extraction incl. OCR with GPU-Acceleration in Snowpark Container Services

Michael Gorkow
3 min readApr 14, 2024

--

Images created by AI — hence the funny typos

Be honest: How much of your data is buried somewhere in documents like PDFs, Images, etc.?

Wouldn’t it be amazing if we could make these documents searchable and analyzable? And why should we only offer searches? In the era of large language models, once this data is extracted, it can be integrated into the internal corporate knowledge database, which can then be accessed via chatbot.

In this blog, I’ll show you how you can build an extraction pipeline using open-source libraries and Snowpark Container Services, on which basis we will then build a RAG application in the next article.

Our Pipeline will do the following:

  1. PDF Text Extraction with Optical Character Recognition (OCR)
  2. Visualize extractions

We will be using the following main libraries for this:

PyMuPDF

PyMuPDF is used to turn pages into images for the OCR library. While there are other libraries available (e.g. PyPDF, pdfplumber), PyMuPDF offers the most features and has the highest performance.

PaddleOCR

PaddleOCR is an open-source Optical Character Recognition (OCR) library developed by PaddlePaddle, which is Baidu’s artificial intelligence platform. It supports running on GPUs to achieve exceptional performance.

I also tried EasyOCR but PaddleOCR delivered similar OCR quality at a faster speed.

The final pipeline consists of two containers:

Container 1: Text Extraction with Optical Character Recognition (OCR)

This container runs the core part of the extraction pipeline:

  1. Load PDF File with PyMuPDF
  2. Convert every page into an image
  3. Run OCR on the image

Container 2: Visualize Extractions

This container will use the extractions to visualize PDF pages with annotations based on the extraction pipeline. The visualizations can then be viewed in a Streamlit App or any other interface that can draw base64 encoded images.

Final pipeline

You can find all the setup steps here so I won’t copy & paste them here. After you have done the initial setup, the pipeline can be executed by simply calling your new function PYMUPDF_PADDLEOCR_EXTRACT():

USE ROLE CONTAINER_ROLE;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA OCR_DEMO.PUBLIC;

-- Create a table with extractions
CREATE OR REPLACE TABLE RAW_EXTRACTS AS (
SELECT RELATIVE_PATH,
PYMUPDF_PADDLEOCR_EXTRACT('@OCR_DEMO.PUBLIC.DOCUMENTS',RELATIVE_PATH) AS OCR_RESULTS
FROM DIRECTORY('@OCR_DEMO.PUBLIC.DOCUMENTS')
);

-- Transform extractions into lines
CREATE OR REPLACE TABLE LINE_LEVEL_EXTRACTS AS (
SELECT relative_path,
ocr_data.index::INTEGER AS OCR_PAGE_NUMBER,
page_level_data.index::INTEGER AS OCR_LINE_NUMBER,
page_level_data.value[0]::ARRAY AS OCR_BBOX,
page_level_data.value[1][0]::STRING AS OCR_TEXT,
page_level_data.value[1][1]::FLOAT AS OCR_CONFIDENCE,
page_rotations.value::INT AS PAGE_ROTATION
FROM RAW_EXTRACTS,
LATERAL FLATTEN(input => OCR_RESULTS['OCR_RESULTS']) ocr_data,
LATERAL FLATTEN(input => ocr_data.value) page_level_data,
LATERAL FLATTEN(input => OCR_RESULTS['PAGE_ROTATIONS']) page_rotations
HAVING OCR_PAGE_NUMBER = page_rotations.index
);

-- Transform lines into pages
CREATE OR REPLACE TABLE PAGE_LEVEL_EXTRACTS AS (
SELECT RELATIVE_PATH,
OCR_PAGE_NUMBER,
LISTAGG(OCR_TEXT, ' ') WITHIN GROUP (ORDER BY OCR_LINE_NUMBER ASC) AS OCR_PAGE_TEXT,
AVG(PAGE_ROTATION) AS PAGE_ROTATION
FROM LINE_LEVEL_EXTRACTS
GROUP BY RELATIVE_PATH,OCR_PAGE_NUMBER
);

SELECT * FROM PAGE_LEVEL_EXTRACTS ORDER BY RELATIVE_PATH, OCR_PAGE_NUMBER;

As you can see, we create three different tables:

  • RAW_EXTRACTS
  • LINE_LEVEL_EXTRACTS
  • PAGE_LEVEL_EXTRACTS

Having the raw output allows us to adapt our post-processing pipeline at any time without having to run OCR again.

LINE_LEVEL_EXTRACTS is really useful if we want to add further features like filtering low-confidence rows or detecting headers and footers. I also use this table to visualize the outputs since this is the level on which we have bounding box information.

PAGE_LEVEL_EXTRACTS however becomes useful when we start building our RAG application.

Results

Here’s a small video that shows the final process of uploading three PDF files to Snowflake, running the OCR pipeline and finally visualizing the pages and their extractions.

Demo: Running OCR on PDF documents and visualizing results in Streamlit

All the code is located in this Github repository.

Michael Gorkow | Field CTO Datascience @ Snowflake

--

--

Michael Gorkow

Field CTO Datascience @Snowflake Germany | Passionate about all things related to data science! | Personal Account, Contents are my own thoughts!