Chunking Strategies Optimization for Retrieval Augmented Generation (RAG) in the Context of Generative AI

thallyscostalat
3 min readJan 14, 2024
Source

When handling external documents, the first step involves breaking them down into smaller segments to extract detailed features, which are then embedded to convey their semantics. However, embedding excessively large or overly small text segments can result in less-than-optimal outcomes. Therefore, it is crucial to determine the optimal segment size for documents within the corpus to ensure the accuracy and relevance of the retrieved results.

Selecting an appropriate segmentation strategy requires careful consideration of several essential factors, including the nature of the indexed content, the embedding model, its optimal block size, the expected length and complexity of user queries, and how the retrieved results are utilized in a specific application. This short paper introduces key chunking strategies, including fixed methods based on characters, recursive approaches balancing fixed sizes and natural language structures, and advanced techniques considering semantic topic changes.

1. Fixed-size (in characters) Overlapping Sliding Window.

This method involves dividing text into fixed-size chunks based on character count. The simplicity of implementation and the inclusion of overlapping segments aim to prevent cutting sentences or thoughts. However, limitations include imprecise control over context size, the risk of cutting words or sentences, and a lack of semantic consideration. Suitable for exploratory analysis but not recommended for tasks requiring deep semantic understanding.

Example using LangChain:

text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])

2. Recursive Structure Aware Splitting.

A hybrid method combining fixed-size sliding window and structure-aware splitting. It attempts to balance fixed chunk sizes with linguistic boundaries, offering precise context control. Implementation complexity is higher, with a risk of variable chunk sizes. Effective for tasks requiring granularity and semantic integrity but not recommended for quick tasks or unclear structural divisions.

Example using LangChain:

text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 256,
chunk_overlap = 20,
separators = ["\n\n", "\n"]
)

docs = text_splitter.create_documents([text])

3. Structure Aware Splitting (by Sentence, Paragraph).

This method considers the natural structure of text, dividing it based on sentences, paragraphs, sections, or chapters. Respecting linguistic boundaries preserves semantic integrity, but challenges arise with varying structural complexity. Effective for tasks requiring context and semantics, but unsuitable for texts lacking defined structural divisions.

Example:

text = "..." # your text
docs = text.split(".")

4. Content-Aware Splitting (Markdown, LaTeX, HTML).

This method focuses on content type and structure, especially in structured documents like Markdown, LaTeX, or HTML. It ensures content types are not mixed within chunks, maintaining integrity. Challenges include understanding specific syntax and unsuitability for unstructured documents. Useful for structured documents but not recommended for unstructured content.

Example for Markdown texts using LangChain:

from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

Example for LaTex texts using LangChain:

from langchain.text_splitter import LatexTextSplitter
latex_text = "..."
latex_splitter = LatexTextSplitter(chunk_size=100, chunk_overlap=0)
docs = latex_splitter.create_documents([latex_text])

5. NLP Chunking: Tracking Topic Changes.

A sophisticated approach based on semantic understanding, dividing text into chunks by detecting significant shifts in topics. Ensures semantic consistency but demands advanced NLP techniques. Effective for tasks requiring semantic context and topic continuity but not suitable for high topic overlap or simple chunking tasks.

Example using NLTK toolkit from LangChain:

text = "..." # your text
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)

In conclusion, effective handling of external documents necessitates a thoughtful segmentation strategy, considering various factors such as the nature of indexed content, the embedding model, user query expectations, and application-specific requirements. The presented chunking strategies offer a spectrum of approaches, each with its strengths and limitations.

--

--