The Missing Link in RAG Systems: Bridging Text and Visuals in PDFs for Better Results

Published in

Google for Developers EMEA

6 min readSep 23, 2024

Introduction

Handling PDFs in Retrieval-Augmented Generation (RAG) systems presents unique challenges due to the diverse nature of PDF content. Many documents combine text with images, diagrams, charts, and other non-text elements, which can be critical to fully understanding the material. Traditional RAG systems typically focus on extracting and indexing text using tools like OCR or PyPDFLoader. While this works for text-heavy documents, these systems struggle with images and other non-textual data, leading to incomplete or less meaningful responses.

I propose a method that leverages Gemini 1.5 Flash, a multimodal model capable of processing text, images, and other types of media. By marking pages containing non-text elements, embedding the text and images, and storing Base64-encoded versions of the entire PDF page in a vector database, this method preserves the full context of both text and visuals. This results in more accurate responses for tasks like document summarization, Q&A, and data extraction — especially when the content contains important visual elements.

Challenges with Traditional Text-Only RAG in PDF Handling

Text-Only Focus: Current RAG systems extract and index only text from PDFs, leaving out essential non-textual elements like images, charts, and illustrations. This is a significant limitation in documents where these elements are integral to conveying meaning. For example, in scientific papers, visual aids such as graphs and diagrams often hold the key to understanding complex processes or results, and text extraction alone fails to capture this.
Loss of Spatial Relationships: The positioning of text and images on a page frequently holds crucial meaning. In technical manuals or research papers, a diagram placed beside a paragraph often explains or complements the text. When text and visuals are stored separately in the RAG pipeline, this spatial relationship — and the additional meaning it provides — gets lost. As a result, the system cannot fully understand the context, leading to less coherent or useful responses.
Limited Context for Non-Textual Data: When RAG systems index only the text, the model lacks access to the complete context of the document. Visual elements often carry information that text alone cannot fully convey. Ignoring these elements can result in incorrect or incomplete answers, particularly in cases where visual data is essential for understanding the document.

Marking and Handling PDF Pages with Non-Text Content

To address these challenges, it’s important to detect and mark PDF pages that contain non-text content, such as images, illustrations, or charts.

Step 1: Identifying Non-Text Elements: The first step involves identifying which pages contain visual elements. This can be done by using a combination of text extraction tools like PyPDFLoader for the text content and additional image recognition models to scan for non-textual elements. Based on the nature of the document, a multi-model approach can be employed to classify and segment these pages.
Step 2: Using Multi-Model Approaches: Because different types of data require different methods, flexibility is key. A variety of tools can be used to detect non-text content, including models that process images and text together. The specific approach depends on the data and its structure. For some documents, the entire page may need to be indexed as a single unit, while others might require breaking the content into smaller chunks. Testing different techniques on your dataset is essential for finding the optimal approach.
Step 3: Marking for Metadata Storage: Once identified, the pages that contain non-text elements are marked for special treatment during the indexing phase. This involves embedding the content (both text and visual) and preparing it for enhanced storage in the vector database.

Storing Non-Text Data in the Vector Database

Once the non-text elements are identified and marked, they need to be stored in a way that preserves both text and visual content.

Embedding Text and Visual Elements: As in traditional RAG systems, text is embedded and indexed in the vector database. If the document contains both text and visuals, visual elements are either embedded directly (if the model supports it) or referenced as additional context. These embeddings allow the system to retrieve relevant content when queried.
Storing Base64-Encoded Pages as Metadata: In addition to the standard text embeddings, this method adds an extra layer by storing the entire PDF page as a Base64-encoded string in the vector database’s metadata. This ensures that when retrieving a page, the system has access to both the visual and textual content in its original layout. Storing the entire page as Base64 allows the system to maintain the full visual context, including diagrams, images, and the spatial relationship between elements.
Alternative: Storing Individual PDF Pages Directly: If Base64-encoded storage becomes a constraint due to database space or resource limitations, a more efficient approach is to store each individual PDF page as a file, rather than encoding it. The vector database would then store the location of the file rather than the actual Base64 data. This approach reduces the strain on database capacity while still allowing the system to retrieve full pages and display them when needed.
Database Space Considerations: Storing non-text data alongside traditional embeddings can consume significant storage space. It’s essential to assess whether the vector database has the capacity to handle large amounts of Base64-encoded data or external PDF files. If space is a concern, storing the Base64-encoded pages externally and simply keeping the file paths in the vector database could offer a more scalable solution.

Using Non-Text Data to Enhance Model Responses

Once the data is indexed and stored with both text and non-textual content, it can be used to improve the quality of responses during inference.

Leveraging Gemini 1.5 Flash’s Multimodal Capabilities: Gemini 1.5 Flash is a multimodal model, meaning it can process both text and images together, understanding them in context. During inference, when the system retrieves a document or page, instead of sending only the OCR-extracted text to the model, it can send the entire Base64-encoded page or the individual PDF page. This allows the model to understand the document as a whole, including the relationship between the text and visual elements.
Here’s an example of how this could work in practice:

import vertexai
import base64
from vertexai.generative_models import GenerativeModel, Part, GenerationConfig

# When running on Colab
from google.colab import auth
auth.authenticate_user()

PROJECT_ID = "<YOUR-PROJECT-ID>"
REGION = "<YOUR-REGION>"
vertexai.init(project = PROJECT_ID , location = REGION)

prompt = """
Explain the illustration on the page.
"""
file_url = "<FILE-LOCATION>"
with open(file_url, "rb") as pdf_file:
    pdf_data = pdf_file.read()
pdf_data_base64 = base64.b64encode(pdf_data).decode('utf-8')

document = Part.from_data(data=pdf_data, mime_type="application/pdf")

model = GenerativeModel("gemini-1.5-flash-001")

generation_config = GenerationConfig(
        max_output_tokens=8192,
        temperature=0,
        top_p=0.95,
)
    
responses = model.generate_content(
        [document, prompt],
        generation_config=generation_config,
)
    
print(responses.text)

Improved Answer Quality: By providing the model with the complete document context (both text and visuals), the system is capable of generating richer, more meaningful responses. Whether summarizing a technical document, answering a complex question, or performing entity extraction, the inclusion of visual data allows the model to make more informed decisions. For example, a model can reference both a chart and its accompanying paragraph to give a more accurate and insightful explanation.

Practical Considerations

Page Chunking vs. Full-Page Storage: Depending on the type of document, you may need to break up the content into smaller, more manageable chunks. If chunking is necessary, storing the full Base64-encoded page in each chunk ensures that the visual context is retained even when the text is split. This approach can be particularly useful in large, image-heavy documents where images and text are closely related.
Database Space Management: Storing Base64-encoded data or external PDF pages can quickly consume storage space, so it’s critical to evaluate whether your vector database can handle the volume. If space is limited, consider storing files externally and referencing them in the database metadata. This balances efficient storage with the ability to retrieve full-page data when needed.

Conclusion

By enhancing the RAG pipeline to capture both text and non-text content in PDFs and leveraging Gemini 1.5 Flash’s multimodal capabilities, this method significantly improves the quality of document understanding. Storing full-page data in Base64-encoded form or as individual pages in the vector database preserves the relationship between text and images, leading to more contextually rich and accurate model responses. Whether the goal is document summarization, Q&A, or entity extraction, this approach ensures that the model has access to the full scope of the document, improving the overall quality of its outputs.

* I would like to thank Sascha Heyer for his insightful article on “Multimodal Document Processing.” His work provided a valuable foundation for the code sample used in this article, and his approach inspired much of the technical implementation presented here.