RAG Document Parsers

How Do You Choose A Document Parser For Your RAG Application?

Published in

EMAlpha

5 min readJul 21, 2024

Guidelines For Choosing The Right Text Parsing | Skanda Vivek

While RAG has become increasingly popular over the last few months, document parsing is a critical but less recognized area. Ultimately, you can use all sorts of specialized retrieval and generation methods — but the results returned are only as good as the documents. And if documents have issues like missing information or incorrect formatting, then further optimization of retrieval strategy, embedding models, LLMs, etc. cannot save you.

In this article, we are going to look at 3 characteristic document extraction strategies that have become increasingly popular. In this tutorial, we will walk through the use-case of parsing a table from the Amazon Q1 2024 report.

Page 11 from Amazon Q1 2024 Financial Report

Text Parsers

Text parsers have been around for a while. These read in documents, and are able to obtain text from the files. Examples include PyPDF, PyMUPDF, and PDFMiner. Let’s look at PyMUPDF and use the LlamaIndex integration of PyMUPDF to parse the page as above. Here is the code:

from llama_index.core.schema import TextNode
from llama_index.core.node_parser import SentenceSplitter
import fitz

file_path = "/content/AMZN-Q1-2024-Earnings-Release.pdf"
doc = fitz.open(file_path) 
text_parser = SentenceSplitter(
    chunk_size=2048,
)
text_chunks = [] #C
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
nodes = [] #D
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    nodes.append(node)
print(nodes[10].text)

PyMUPDF does a good job of extracting all the text (below). However, it is not nicely formatted. This could be a problem during generation if the LLM is unable to make out the document structure.

AMAZON.COM, INC.
Consolidated Statements of Comprehensive Income
(in millions)
(unaudited)
  
Three Months Ended
March 31,
 
2023
2024
Net income
$ 
3,172 $ 
10,431 
Other comprehensive income (loss):
Foreign currency translation adjustments, net of tax of $(10) and $30
 
386  
(1,096) 
Available-for-sale debt securities:
Change in net unrealized gains (losses), net of tax of $(29) and $(158)
 
95  
536 
Less: reclassification adjustment for losses (gains) included in “Other income 
(expense), net,” net of tax of $(10) and $0
 
33  
1 
Net change
 
128  
537 
Other, net of tax of $0 and $(1)
 
—  
1 
Total other comprehensive income (loss)
 
514  
(558) 
Comprehensive income
$ 
3,686 $ 
9,873

Next, let’s look at the performance of OCR.

OCR For Document Parsing

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
pages = convert_from_path(file_path)
i=10
filename = "page"+str(i)+".jpg"
pages[i].save(filename, 'JPEG')
outfile =  "page"+str(i)+"_text.txt"
f = open(outfile, "a")
text= str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')    
f.write(text)
f.close()

print(text)

OCR (below) does a better job of capturing the document text and structure.

AMAZON.COM, INC.
Consolidated Statements of Comprehensive Income
(in millions)

(unaudited)
Three Months Ended
March 31,
2023 2024
Net income $ 3,172 §$ 10,431
Other comprehensive income (loss):
Foreign currency translation adjustments, net of tax of $(10) and $30 386 (1,096)
Available-for-sale debt securities:
Change in net unrealized gains (losses), net of tax of $(29) and $(158) 95 536
Less: reclassification adjustment for losses (gains) included in “Other income
(expense), net,” net of tax of $(10) and $0 33 1
Net change 128 231
Other, net of tax of $0 and $(1) _— 1
Total other comprehensive income (loss) 514 (558)

Comprehensive income $ 3,686 $ 9,873

Finally, let’s now look at Intelligent document parsing.

Intelligent Document Parsing (IDP)

Intelligent document parsing is a relatively new technique, that aims to universally tackle obtaining all relevant information from documents in a structured format. There are multiple IDPs including LlamaParse, DocSumo,Unstructured.io, Azure Doc Intelligence, and more.

Under the hood, these combine OCR, text extraction, multi-modal LLMs, conversion to markdown — to extract text. Let’s take a look at LlamaParse, released by LlamaIndex. For this, you first need to register a LlamaParse API key, for parsing your documents through an API.

import getpass
import os
from copy import deepcopy

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass()
from llama_parse import LlamaParse
import nest_asyncio
nest_asyncio.apply()
documents = LlamaParse(result_type="markdown").load_data(file_path)
def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = [] #C
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes


nodes_lp = get_page_nodes(documents)
print(nodes_lp[10].text)

The format below is structured in markdown, and seems to be the best representation of structure so far.

# AMAZON.COM, INC.

# Consolidated Statements of Comprehensive Income

| |Three Months Ended March 31, 2023|Three Months Ended March 31, 2024|
|---|---|---|
|Net income|$3,172|$10,431|
|Other comprehensive income (loss):| | |
|Foreign currency translation adjustments, net of tax of $(10) and $30|386|(1,096)|
|Available-for-sale debt securities:| | |
|Change in net unrealized gains (losses), net of tax of $(29) and $(158)|95|536|
|Less: reclassification adjustment for losses (gains) included in “Other income (expense), net,” net of tax of $(10) and $0|33|1|
|Net change|128|537|
|Other, net of tax of $0 and $(1)|—|1|
|Total other comprehensive income (loss)|514|(558)|
|Comprehensive income|$3,686|$9,873|

However, an issue with the above is that it is missing some important context. Note that the parsed document does not mention “millions” any more — making it more likely for the generator LLM to hallucinate.

Takeaways

To optimize your RAG application, you must carefully select the right document parser. As you’ve seen, each parsing strategy offers distinct advantages and challenges:

Text Parsers: When you use tools like PyPDF or PyMUPDF, you’ll extract text efficiently. However, you may lose document structure, potentially confusing your LLM during generation.
OCR: If you opt for OCR tools like Pytesseract, you’ll capture both text and structure more effectively. This approach helps you preserve the original format and context better than basic text parsers. However, be aware that OCR often comes with high latency, and its effectiveness can be highly dependent on your specific use case. You’ll need to evaluate whether the improved accuracy justifies the increased processing time for your application.
Intelligent Document Parsing (IDP): By choosing advanced IDP methods like LlamaParse, you’ll combine OCR, text extraction, and multi-modal LLMs. This strategy allows you to convert documents into well-structured markdown format. However, be aware that you might occasionally lose critical context, such as units of measurement. Additionally, keep in mind that IDP is a less mature technology, currently facing scalability challenges and high latency issues. As you implement IDP, you’ll need to carefully consider these limitations and plan for potential bottlenecks in your system.

Ultimately, your choice is use-case dependent. The best way to know for sure is to evaluate your application with different parsers, and choose the one that satisfies all your criteria. You may even find that a combination of approaches works best for your specific needs. Keep experimenting and refining your approach to achieve the best results for your RAG application.

Check out the GitHub tutorial for a detailed walkthrough:

GitHub - skandavivek/RAG-Doc-Parsers: This repository demonstrates different document parsing…

This repository demonstrates different document parsing strategies for Retrieval-Augmented Generation (RAG)…

github.com

If you like this post, follow EMAlpha — where we dive into the intersections of AI, finance, and data.