RAG: Part 2: Chunking

Mehul Jain
7 min readApr 5, 2024

--

The information is endless, and we have limited resources to digest this information. Similarly, in the domain chatbot, we have a ton of documents that can’t be passed into the LLMs as supporting context due to the limited context window size. Chunking the documents is the solution.

Photo by Vardan Papikyan on Unsplash

In this blog, I will cover various chucking techniques but before that, let's briefly discuss what is chucking, and its pros and cons.

Chunking

It is a crucial step in Retrieval-Augmented Generation (RAG) as it breaks down long documents into smaller, more manageable units. These units, called chunks or passages, are then used for efficient retrieval and provide more focused context for the LLM during response generation.

Advantages of Chunking:

  • Improved Retrieval Efficiency: Smaller units allow for faster and more focused similarity searches in the vector database.
  • Enhanced LLM Context: Chunks provide the LLM with more specific information for generating relevant and informative responses.
  • Flexibility in Retrieval: Different chunks can be retrieved depending on the specific query, leading to more nuanced responses.

Disadvantages of Chunking:

  • Information Loss: Chunking can lead to some information loss at boundaries.
  • Increased Computational Cost: Chunking adds an additional processing step that may impact overall system speed.
  • Tuning for Optimal Performance: Choosing the right chunking technique and parameters requires careful experimentation.

Chunking Techniques:

1. Fixed-Size Chunking:

The text is chopped up into chunks based on a predetermined number of characters (e.g., 512 tokens). Each chunk becomes a unit for processing within the RAG model.

Advantages:

  • Simplicity: Fixed-size chunking is very easy to implement. You just define the desired chunk size and split the text accordingly.
  • Efficiency: This method is computationally efficient because it doesn’t require complex analysis of the text structure.

Disadvantages:

  • Loss of Context: Since it relies solely on character count, fixed-size chunking can break sentences or paragraphs in half, potentially disrupting the meaning and flow of the text.
  • Ignores Structure: This approach disregards the inherent structure of the text, such as sentences or paragraphs. This can be problematic for tasks where capturing semantic relationships is crucial.

2. Sentence-Based Chunking:

The text is divided into individual sentences using full stop (.), question mark (?) and exclamation mark (!) as delimiters. Each sentence becomes a separate chunk for processing within the RAG model. It is usually used where understanding the meaning within each sentence is important.

Advantages:

  • Preserves Meaning: Sentence-based chunking helps retain the semantic sense of the text by keeping sentences intact. This ensures the model receives contextually relevant units for information retrieval and generation.
  • Improved Efficiency: Compared to fixed-size chunking, it’s still a relatively simple method to implement, making it computationally efficient.

Disadvantages:

  • Limited Scope: While it maintains sentence-level meaning, sentence-based chunking might not capture broader relationships between sentences that span multiple chunks.
  • Potential Redundancy: Depending on the nature of the text, there could be redundancy across consecutive sentences within a chunk.

3. Sliding Window Chunking:

Sliding window chunking is a technique used to segment data streams or text into overlapping chunks. It combines elements of fixed-size chunking and sentence-based chunking, offering some advantages over both. Unlike fixed-size chunking with clean cuts, the window in sliding window chunking overlaps with the previous window by a certain amount. This overlap ensures that context is preserved across chunks.

Advantages:

  • Preserve Context: By incorporating overlap, sliding window chunking avoids the issue of breaking sentences or important information at chunk boundaries, which can happen in fixed-size chunking.
  • Flexibility: The window size and overlap size can be adjusted based on the specific needs of the task. For instance, a larger window with more overlap might be useful for capturing long-range dependencies in the data, while a smaller window with less overlap might be better for tasks requiring more granular analysis.

Disadvantages:

  • Increased Complexity: Compared to fixed-size chunking or sentence-based chunking, implementing a sliding window requires handling overlaps and potential redundancies across chunks, making it slightly more complex.
  • Computational Overhead: The overlapping nature can lead to some redundancy in the processed data, potentially increasing computational demands.

Deciding the right chuck size and overlap is purely based on experimentation. Please refer to this wonderful blog post to see how you can achieve suitable chuck parameters for your use case.

4. Recursive Character Text Splitter:

As the name suggests, it recursively tries to split the data. The splitter has a set of pre-defined characters it uses to attempt to split the text. These characters typically include things like newline characters (\n), double newline characters (\n\n), spaces ( ), and tabs (\t).

It prioritizes larger chunks. It starts by trying to split the text using the first character on the list (e.g., double newline). If the resulting chunks are still larger than a specified chunk size, it moves on to the next character type (e.g., newline) and attempts to split again. This continues until the chunks are all smaller than the chunk size or it runs out of character types to try.

Advantages:

  • Preserves Meaning: By prioritizing splits at natural text boundaries like paragraphs and sentences (using newlines and spaces), the Recursive Character Text Splitter aims to keep semantically related pieces of text together. This helps maintain the overall flow and meaning of the text.
  • Flexibility: It can be customized with different character sets to prioritize specific splitting behaviours. For instance, you could add punctuation marks to the list if you want to ensure sentences aren’t broken up.

Disadvantages:

  • Computational Cost: Recursion can be computationally expensive for very large texts.
  • Potential Oversplitting: Depending on the character set and chunk size, there’s a possibility of oversplitting the text, creating very small chunks that might not be ideal for all applications.

5. Embedding based chunking

The first step involves dividing the text into smaller segments using various chunking techniques which we have learned so far.

Once the text is chunked, each chunk is fed into an embedding model. This model transforms the chunk of text into a dense vector representation.

Calculate the semantic similarity (cosine similarity) of the adjacent chucks. If the similarity is above a specific threshold. Merge the 2 chunks.

Advantages:

  • Enhanced Text Understanding: By considering meaning, semantic chunking can capture complex relationships within the text that might be missed by simpler chunking methods. This deeper understanding can be valuable for various NLP tasks.
  • Improved Task Performance: In tasks like question answering, text summarization, or sentiment analysis, semantic chunking can help the model focus on the relevant parts of the text and improve the overall performance.

Disadvantages:

  • Computational Complexity: The reliance on advanced NLP techniques can make semantic chunking computationally expensive, especially for large amounts of text.
  • Data Dependency: The effectiveness of semantic chunking can be highly dependent on the quality of the training data used for the NLP models involved.

6. NSP based chunking

While next sentence prediction (NSP) isn’t a direct chunking method in RAG models, it can be an interesting concept to explore for informing chunking decisions. Here’s how it might work:

Idea:

  • Train a next-sentence prediction model on a massive dataset of text and sentence pairs, or use pre-trained Bert models.
  • For a given text, the model would predict how likely a potential split between sentences is to be a natural continuation point.

Potential Benefits:

  • Improved Context Awareness: By leveraging NSP, this approach could go beyond basic chunking methods and consider the semantic flow between sentences.
  • Data-Driven Approach: The NSP model can be trained on various text data, potentially adapting to different writing styles and incorporating domain-specific knowledge.

Disadvantages:

  • NSP Model Training: Training a high-quality NSP model requires a large amount of text data, which can be resource-intensive.
  • Computational Cost: Running the NSP model for every candidate split can add computational overhead to the chunking process.
  • Trade-offs: There might be a trade-off between accuracy and efficiency. A more complex NSP model might provide better predictions but require more resources.

7. Content-Aware chucking

It is specifically designed for structured documents written in languages like Markdown, LaTeX, and HTML. It focuses on splitting the text based on its inherent structure and content type, ensuring chunks don’t contain mixed elements. It leverages the markup syntax of these languages to identify structural elements like headings, code blocks, tables, and lists.

Advantages:

  • Preserve Meaning: By respecting the document’s structure, Content-Aware Splitting helps maintain the intended meaning and context of the text within each chunk. This is crucial for tasks where structure is important, like information retrieval or summarization.
  • Improved Processing: When processing the chunks further (e.g., sentiment analysis, information extraction), the model can focus on a specific content type within each chunk, potentially leading to better results.

Disadvantages:

  • Limited Scope: This method is primarily suited for structured documents with well-defined markup. It might not be as effective for plain text documents lacking clear structural elements.
  • Potential Overheads: Depending on the complexity of the markup language and the chunking implementation, there could be some overhead associated with parsing the structure before splitting the text.

Conclusion

Chunking is a vital component of RAG, enabling efficient information retrieval and context-aware response generation. The choice of technique depends on various factors, and experimentation is key to finding the best approach for your specific use case.

Thanks for spending your time on this blog. I am open to suggestions and improvements. Please let me know if I missed any details in this article.

--

--