Smarter RAG with LLMSherpa’s smart chunking: A Streamlit Chatbot for querying PDF documents

Ioana Dragan
12 min readJul 7, 2024

--

The development of Retrieval-Augmented Generation (RAG) systems has revolutionized our ability to interact with large volumes of information. However, when I began experimenting with building a RAG system for my documents, particularly PDFs, I encountered a significant challenge: finding an effective way to chunk text that preserves related information and maintains context.

The limitations of traditional text splitters

Extracting meaningful content from PDFs for use in natural language processing applications presents unique challenges. The complex layouts and multi-level sections typical of PDF documents often prove too intricate for traditional text-splitting methods. This complexity can lead to fragmented context, lost hierarchical structure, and ultimately, diminished performance in downstream tasks such as question-answering systems.

Popular frameworks like LlamaIndex and LangChain offer various text splitting methods, such as character-based, token-based, and sentence-based splitters. While these approaches work well for simple text documents, they often fall short when processing PDFs due to several key limitations:

  1. Lack of structural awareness: Traditional splitters typically treat the document as a continuous stream of text, ignoring the inherent structure of PDFs. This can lead to chunks that cut across different sections or topics, losing important context.
  2. Fixed chunk sizes: Many splitters use a fixed character or token count to determine chunk boundaries. This rigid approach doesn’t account for the natural divisions within a PDF, such as paragraphs, sections, or chapters.
  3. Inability to handle complex layouts: PDFs often have multi-column layouts, sidebars, or text boxes. Traditional splitters may jumble the reading order, mixing content from different parts of the page.
  4. Loss of hierarchical information: Headings and subheadings, which provide crucial context and structure, are often not distinguished from body text by these splitters.
  5. Difficulty with non-textual elements: Charts, tables, and images in PDFs are typically ignored or poorly handled by text-based splitters, leading to loss of valuable information.
  6. Inconsistent handling of formatting: Text formatting like bold, italic, or different font sizes, which can indicate importance or structure in a PDF, is usually stripped away by basic text extraction methods.
  7. Page boundary issues: Traditional splitters may create chunks that end abruptly at page boundaries, potentially splitting sentences or paragraphs mid-thought.

These limitations can significantly impact the performance of RAG systems. When chunks lack proper context or structure, the retrieval process becomes less accurate, and the generated responses may be inconsistent or irrelevant.

For example, consider a technical manual in PDF format. A traditional splitter might create a chunk that begins with the end of one section, includes an unrelated image caption, and ends with the start of a new section. This chunk would be nearly useless for accurate information retrieval and could lead to confused or misleading responses from the chatbot.

LLMSherpa’s smart chunking

In my search for a more effective PDF parsing solution, I discovered the LayoutPDFReader, presented in this article, a context-aware PDF parser specifically designed for building efficient RAG pipelines.

Introducing LayoutPDFReader for “Context-aware” chunking. LayoutPDFReader can act as the most important tool in your RAG arsenal by parsing PDFs along with hierarchical layout information such as:

1. Identifying sections and subsections, along with their respective hierarchy levels.

2. Merging lines into coherent paragraphs.

3. Establishing connections between sections and paragraphs.

4. Recognizing tables and associating them with their corresponding sections.

5. Handling lists and nested list structures with precision.

The LayoutPDFReader will create smart chunks out of your PDF documents. It is designed to recognize and preserve the inherent structure of PDF documents. It identifies sections, subsections, paragraphs, lists, tables and other structural elements, ensuring that chunks maintain logical coherence.

Instead of arbitrarily splitting text at page boundaries, smart chunking ensures that logical sections are kept together, even if they span multiple pages. This preserves the continuity of ideas and concepts.

A visual representation of the smart chunking approach for sections and subsections:

Screenshot from “Using Document Layout Structure for Efficient RAG

The same chunking is done for lists and tables. All list items are grouped into a single chunk, regardless of page breaks. For tables, the chunking preserves both the header and data in a single unit, enabling various forms of LLM-based data analysis on the table contents.

Each chunk (e.g. section, paragraph, table, list) maintains a reference to it’s parent (e.g. parent section) and inherits contextual information from its parent chain. For example, a paragraph chunk may include the title of the document, the heading of its parent section and the subheading of its immediate parent subsection.

You can read more about the layout aware chunking method employed by the LayoutPDFReader in this article:

A structured approach to document chunking

When examining a well-structured PDF document, with its hierarchy of chapters, sections, and subsections, it’s intuitive to want to split the document along these natural divisions. However, the question arises: which sections are optimal for chunking?

The LLMSherpa layout reader assigns hierarchical levels to document sections, starting with the root at level -1, then progressing through levels 0, 1, 2, and so on. Document structures vary widely across different types. Consumer product manuals, like those for ovens or washing machines, often have a flat structure with few nested headings. In contrast, enterprise technical documentation, such as Amazon AWS user guides, can have deeply nested structures with six or more levels of headings.

This diversity poses a challenge: where do we draw the line for chunking? Which sections are too large or too small? Selecting all sections of a predefined level isn’t practical due to the variability in document structures. Large sections are inefficient for embedding and comparison with user queries, while very small sections may lack sufficient context.

As highlighted in this LlamaIndex blog article, there’s no universally optimal chunk size. It’s a matter of evaluating different sizes against specific data, considering factors such as:

Why Chunk Size Matters

Choosing the right chunk_size is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

Relevance and Granularity: A small chunk_size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity_top_k setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.

Response Generation Time: As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use case and dataset.

In my case, after experimenting with different setups, I chose a fixed chunk size of 2048 characters, and recursively split sections larger than this size into subsections that fit within this limit. All of the chunks obtained are sections of different levels, with a granularity of more or less 2048 characters. This approach maintains document structure integrity, potentially improving the relevance of retriever results by returning entire sections rather than arbitrary slices of text.

CHUNK_SIZE = 2048

def _split_section_to_text(self, section, chunk_size=CHUNK_SIZE):
sub_sections_as_text = []

section_text = ''
for child in section.children:
child_text = child.to_text(include_children=True, recurse=True)

# recursively split section if it is too large, otherwise append it to the current section
if isinstance(child, Section):
if section_text:
sub_sections_as_text.append(section.parent_text() + "\n" + section.title + "\n" + section_text)
section_text = ''

if len(child_text) > chunk_size:
sub_sections_as_text.extend(self._split_section_to_text(child, chunk_size))
else:
sub_sections_as_text.append(child.parent_text() + "\n" + child_text)
else:
# group together paragraghs, tables, etc., everything that is not a section
section_text += ("\n" if section_text else '') + child_text

if section_text:
sub_sections_as_text.append(section.parent_text() + "\n" + section.title + "\n" + section_text)

return sub_sections_as_text

The method for section splitting above is then used to chunk the entire document, starting from the main level 0 sections:

chunks = []
main_sections = [section for section in doc.sections() if section.level == 0]
[chunks.extend(self._split_section_to_text(section, chunk_size=chunk_size)) for section in main_sections]

Note: LLMSherpa allows sections to be converted to text or html. To embed sections as html, you simply call the to_html method:

child_html = child.to_hml(include_children=True, recurse=True)

Building a query engine with LlamaIndex

After preparing our text chunks, we can build a RAG pipeline using LlamaIndex. The process involves several key steps:

  1. Creating Document nodes

We start by creating a list of Document nodes from our text chunks. Each node can include custom metadata such as document title, author, or other information in the extra_info dictionary.

# split the document into chunks of text
self.doc_chunks = self._split_document_to_text(doc, chunk_size=chunk_size, first_n_chunks=first_n_chunks)

nodes = [Document(text=chunk_text, extra_info={}) for chunk_text in self.doc_chunks]

2. Ingestion Pipeline and Metadata Extraction

The IngestionPipeline processes and transforms the input nodes. We can add metadata extractors at this stage:

# add summary metadata to each chunk
if add_summary:
metadata_extractors = [SummaryExtractor(summaries=["self"])]
pipeline = IngestionPipeline(transformations=metadata_extractors)
nodes = pipeline.run(nodes=nodes, in_place=False, num_workers=2, show_progress=True)

LlamaIndex offers useful extractors like the QuestionAnsweredExtractor for generating question/answer pairs from a piece of text, and SummaryExtractor for creating summaries of the current text and adjacent texts. I used a SummaryExtractor for the section level chunks.

3. Creating the Index and Retriever

We then create a VectorStoreIndex and a retriever, , which is the building block of a RAG system:

# create index and retriever
index = VectorStoreIndex(nodes)

self._retriever = VectorIndexRetriever(
index=index,
similarity_top_k=retrieve_top_k
)

The VectorStoreIndex by default stores everything into memory unless we specify a storage context for it. The similarity_top_k parameter controls the number of returned results (default is 2 for VectorIndexRetriever).

For embeddings, the default model used by LlamaIndex is Open AI’s text-embedding-ada-002. The default LLM model is gpt-3.5-turbo.

4. Building the Query Engine

The query engine is built on top of the retriever:

self._response_synthesizer = get_response_synthesizer()

# assemble query engine
self._query_engine = RetrieverQueryEngine(
retriever=self._retriever,
response_synthesizer=self._response_synthesizer,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=similarity_threshold)],
)

We use a response synthesizer to generate LLM responses based on the user query and retrieved context chunks. The default mode (ResponseMode.COMPACT) efficiently fits as many chunks as possible into the context window. More about the response modes here.

5. Node Post-processing

Post-processing allows for customization of retrieved nodes before they’re passed to the LLM. The SimilarityPostprocessor filters out nodes below a certain similarity score:

node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=similarity_threshold)]

This can be used to fine-tune the relevance of included context. Higher similarity_cutoff values improve precision but may reduce context breadth.

Multiple postprocessors can be combined for more sophisticated filtering. For example, you could add a KeywordPostprocessor to ensure critical terms are always included, regardless of similarity score. The list of all the available node postprocessors can be found here.

By carefully configuring these components, you can create a RAG system tailored to your specific document structure and query needs, balancing between precision and context richness in responses.

Streamlit for the user interface

With our query engine ready, we can create a user-friendly interface using Streamlit. Our demo application allows users to:

  1. Input an OpenAI API key
  2. Upload a PDF document
  3. Ask questions about the document

Testing Setup:

  • Document: Amazon CodeWhisperer user guide (209 pages)
  • Resulting chunks: 246 (each representing a document section)
  • Indexing time: Approximately 1 minute (may vary based on system)

Note: Indexing the entire document can take a while because every chunk needs to be embedded and summarized before being added to the index. For quicker experimentation, consider using a smaller document or adjusting the num_workers value for parallel execution in the IngestionPipeline.

Streamlit RAG chatbot

Key Features of the Streamlit RAG Chatbot:

  1. Configurable Parameters:
  • Retrieval top-k: Number of chunks to retrieve (1–5, default 3)
  • Retrieval similarity score: Minimum similarity threshold (default 0.8)

2. Context Visibility: Users can view the retrieved context used by the LLM, including:

  • Similarity score for each chunk
  • Text content of each chunk
User query and retrieved context

To the question “Which IDEs are compatible with CodeWhisperer?” you can see the top 3 chunks retrieved in the screenshot above. The third chunk with the smallest node score of the three is actually the section that describes all of the IDEs that can be used with CodeWhisperer (CodeWhisperer > Setting up > Choosing your IDE).

This example illustrates an important point: the chunks with the highest similarity scores don’t always provide the most relevant information. To mitigate this, we can adjust the ‘Retrieval top-k’ parameter. Increasing this value retrieves more chunks, thereby improving the chances of including the most pertinent information in the context provided to the LLM.

In another example shown below, the section that contains the query’s answer is found in the second retrived chunk: section CodeWhisperer > Features > Language support in Amazon CodeWhisperer.

User query and retrieved context

Note: The Chatbot doesn’t draw upon its general knowledge to formulate responses, it relies solely on the uploaded document. If the retriever doesn’t find any matching nodes, it will literally return the “Empty Response” text as implemented in the LlamaIndex response synthesizer. For user clarity, this has been translated to the more friendly message: “I’m sorry, I can’t find the information in the provided document.”

Conclusions and further directions

The application presented in this article is only a proof of concept, demonstrating a structured RAG approach for complex documents like PDFs. Our method of preserving document structure during chunking has shown promising results in maintaining context and improving retrieval relevance. By recursively splitting sections to a target size while respecting document hierarchy, we’ve achieved a balance between granularity and coherence that enhances the overall performance of the system.

While there are numerous directions for further improvement, we will focus on two of them which seem particularly promissing for future work:

  1. Advanced Metadata Extraction

Our current implementation relies on basic metadata extraction, primarily focused on summaries. While this provides some context, there’s significant room for enhancement. We propose leveraging Large Language Models (LLMs) to generate rich, contextual metadata for each chunk. This could include:

  • Key concepts and terms: Identifying and categorizing the main ideas and specialized vocabulary within each chunk.
  • Question and Answer pairs: Generating potential questions that the chunk answers, creating a more interactive knowledge base.
  • Relationship to other sections: Mapping the interconnections between different parts of the document, enhancing navigation and understanding.
  • Relevance to specific use cases or scenarios: Tagging content with its applicability to various real-world situations or industry contexts.

By implementing these advanced metadata extraction techniques, we can create a more nuanced and contextually aware system, capable of understanding and retrieving information at a deeper level.

2. Hybrid Retrieval Methods

Currently, our system primarily relies on vector-based retrieval using similarity scores. While effective, this method can sometimes miss relevant information that doesn’t closely match in vector space. To address this limitation, we propose developing a hybrid approach:

  • Combining vector similarity search with traditional keyword-based search: This would allow us to capture both semantic similarity and exact matches, providing a more comprehensive retrieval mechanism.
  • Weighting the importance of keyword matches vs. vector similarity: This could be dynamically adjusted based on query type or user preferences, allowing for more flexible and personalized retrieval.
  • Implementing semantic parsing of user queries: By better understanding the intent and structure of the question, we can guide the retrieval process more effectively, possibly by constructing more complex, multi-faceted queries.

These hybrid methods would enable our system to handle a wider range of query types and to find relevant information even when it’s not immediately apparent through vector similarity alone.

By focusing on these two areas — advanced metadata extraction and hybrid retrieval methods — we can significantly enhance the capabilities of our RAG system. The richer context provided by advanced metadata will improve both retrieval accuracy and response generation. Meanwhile, hybrid retrieval methods will increase the system’s flexibility and robustness across diverse query types and document structures.

The source code is available on Github.

References:

https://github.com/nlmatics/llmsherpa

https://github.com/run-llama/llama_index

--

--

Ioana Dragan

I am a Senior Java Developer with a passion for Machine Learning and AI