Breaking Down Text: Exploring Multiple Chunking Methods for RAG and LLM

Shweta Gargade
7 min readMar 28, 2024

In the realm of natural language processing (NLP), models like Retrieve-and-Generate (RAG) and Large Language Models (LLMs) have revolutionized the way we interact with and extract insights from textual data. These models excel at understanding and generating human-like responses, making them indispensable tools for various NLP tasks. However, behind their remarkable capabilities lies a fundamental preprocessing step known as chunking, which plays a crucial role in their effectiveness.

Why is Chunking Necessary?

Imagine trying to comprehend an entire book in a single glance — it’s an overwhelming task, to say the least. RAGs and LLMs have a limited context window, meaning they can only process a certain amount of text at a time. This is where chunking comes into play. Chunking ensures each segment contains relevant information, minimizing noise for easier processing and understanding by LLMs.

Chunking involves breaking down large documents or texts into smaller, manageable chunks or segments. These chunks are typically of a fixed size or based on predefined criteria such as paragraphs, sentences, or even arbitrary divisions. By segmenting the text, chunking enables models to focus on smaller units at a time, making it easier to process and analyze the content effectively.

For instance, let’s take the first paragraph of the paper:

“Climate change is one of the most pressing issues facing our planet today. Its far-reaching impacts extend to various aspects of life, including the environment, economy, and human health.”

Instead of processing the entire paper at once, we chunk it into smaller segments, like so:

  1. Chunk 1: “Climate change is one of the most pressing issues facing our planet today.”
  2. Chunk 2: “Its far-reaching impacts extend to various aspects of life, including the environment, economy, and human health.”

Discovering the ideal chunk size or best chunking strategy is a challenging task. Experimentation and evaluation play a crucial role here, as varying chunk sizes can yield distinct retrieved results, even with identical queries.

In this blog post, we’ll delve into a comparison between various rule-based chunkins strategies and semantic clustering approaches. At its core, rule-based methods rely on explicit separators like spaces or \nor \n\nor advanced systems such as regexor nltk or langchain, to segment text into chunks. In contrast, semantic clustering methods utilize machine learning algorithms to decipher context and identify natural divisions within the text, leveraging its inherent meaning for chunking.

1. Baseline Chunking Strategy (Content-aware)

Paragraph-level chunking in Python involves splitting a text document into segments based on paragraphs. You can achieve this using various techniques, such as regex pattern matching or libraries like NLTK (Natural Language Toolkit), or just simple python function .split() based on \n\n

Below is an example of a baseline chunking approach that splits an entire document into paragraph-level chunks using Python:

# Example document
document = """
Our Mission:
Vision is to Equip the Students with the Knowledge and Practice of the Technologies to prepare them for the Emerging Jobs of the IT Industry

Who We Are:
VisionNLP is a trending AI/ML e-learning platform. We help our participants to upgrade their skills. VisionNLP helps you to deploy Real-World AI solutions to solve your business problems using skills that you're going to learn with us. We build next generation Data Scientist's from Beginners to advanced levels. We understood students need and provide training for Statistics and mathematics, Python,Machine learning, Data Science, Deep learning, NLP.

We help college students, company employees to become professional Data Scientists and provide corporate training to companies to get trained, acquire certifications, and upskill their employees. We have highly qualified trainers/experts, with them we provide online classes, Self-learning platform, project work, and 24/7 teaching assistance.

Become a mastermind in your career.
"""
# Split the document into paragraphs
paragraphs = document.split('\n\n') # Assuming paragraphs are separated by double newline characters

# Process each paragraph and create chunks
chunks = []
for i, paragraph in enumerate(paragraphs, start=1):
# Skip empty paragraphs
if not paragraph:
continue

# Append the paragraph as a chunk
chunks.append(paragraph)

# get chunks in a list
chunks

Have you ever wondered if our chunking strategy can handle large paragraphs effectively? Let’s explore further.

Imagine you’re dealing with lengthy paragraphs in your text. Will it still be feasible to process them directly through our language model? Considering the token limit constraints of large language models (LLMs), this becomes a crucial question to address. How can we adapt our strategy to handle such scenarios efficiently?

One essential parameter to consider is the chunk size. By breaking down the text into smaller, manageable chunks, we can ensure that each segment stays within the token limit of the LLM. But how do we determine the optimal chunk size for our strategy? This parameter plays a vital role in the effectiveness of our chunking approach.

def split_large_text(large_text, chunk_size):
# Tokenize the text by splitting it into words
words = large_text.split()

chunks = []
current_chunk = []
current_length = 0

for word in words:
current_chunk.append(word)
current_length += 1

# If the current chunk exceeds or equals the maximum number of tokens, add it to chunks
if current_length >= chunk_size:
# Combine the words in the current chunk into a string and remove trailing punctuation
chunk_str = ' '.join(current_chunk).rstrip(' .,;')
chunks.append(chunk_str)
current_chunk = []
current_length = 0

# Add the remaining chunk if it's not empty
if current_chunk:
chunk_str = ' '.join(current_chunk).rstrip(' .,;')
chunks.append(chunk_str)

return chunks

# Example usage
chunk_size = 5 # Maximum number of tokens per chunk

chunks = split_large_text(document, chunk_size)
chunks

Now that we’ve explored implementing the baseline chunking strategy, let’s consider the real-world scenario where paragraphs in our data are often extensive. Determining the ideal chunk size for splitting large texts can be quite a puzzle, don’t you think? How do we strike the balance between chunk sizes to ensure readability without losing critical information?

Shorter chunks provide detailed information but may lack context, leading to potential ambiguity. On the other hand, larger chunks offer broader context, enhancing coherence, but they may also introduce noise or irrelevant information. The optima chunk size depends on use case and desired outcome of the system.

Limitations:

  1. Fixed Size: Chunks may not fit document variation, leading to coherence issues.
  2. Lack of Contextual Understanding: The baseline chunking strategy treats each chunk as an independent unit without considering the contextual relationships between adjacent chunks.
  3. Insensitivity to Document Structure: The baseline chunking strategy does not take into account the structural elements of the document, such as headings, subheadings.

2. Fixed Size Chunking

This is the most common and straightforward approach to chunking: we simply decide the number of tokens(words) in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking will be the best path in most common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any NLP libraries.

First we need to count no of tokens (For Simplicity; No. of words and symbols) we have in our entire document and then split the document based on provided fixed size tokens.

import nltk
from nltk.tokenize import word_tokenize

def chunk_text(text, chunk_size):
tokenized_text = word_tokenize(text)
num_tokens = len(tokenized_text)

chunks = []
start_index = 0

while start_index < num_tokens:
end_index = min(start_index + chunk_size, num_tokens)
chunk_tokens = tokenized_text[start_index:end_index]
chunk = ' '.join(chunk_tokens)
chunks.append(chunk)
start_index = end_index

return chunks

# Example usage
chunk_size = 20
chunks = chunk_text(document, chunk_size)

3. Langchain Chunking

Langchain offers various chunking approaches. Let’s explore each one step by step. This method is quite easy to implement and quite popular for RAG usecases.

1. Character Text splitter (Fixed Size Chunking)

This is the simplest method. This splits based on characters "”and measure chunk length by number of characters. This method is also called as fixed size chunking with overlap.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)

# Create the chunks
texts = text_splitter.create_documents([document])

2. Recursive character text splitter

This text splitter is highly recommended for processing generic text. It operates based on a list of characters, attempting to split the text accordingly until the chunks reach an optimal size. By default, it prioritizes splitting at paragraph breaks (“\n\n”), followed by individual lines (“\n”), spaces (“ “), and finally, empty strings (“”). This approach aims to preserve the semantic relationship of the text, keeping paragraphs, sentences, and words together as much as possible.

  • Text Splitting Method: By a specified list of characters.
  • Chunk Size Measurement: Based on the number of characters.
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 256,
chunk_overlap = 20
)

docs = text_splitter.create_documents([document])

Figuring out the best chunk size for your application

Looking to optimize your chunk size but unsure where to start? Here are some key steps to guide you in finding the perfect fit for your data:

How do I start optimizing my chunk size?
Begin by preprocessing your data to ensure its quality. This might involve tasks like removing HTML tags or noisy elements, especially if your data comes from the web.

What’s the next step after preprocessing?
Select a range of potential chunk sizes to test. Consider factors like the nature of your content and the capabilities of your embedding model. Aim for a balance between capturing semantic information and maintaining context.

What are the common chunk sizes?
Start by exploring a variety of chunk sizes, including smaller chunks (e.g., 128 or 256 tokens) for capturing more granular semantic information and larger chunks (e.g., 512 or 1024 tokens) for retaining more context.

How do I evaluate the performance of different chunk sizes?
You can use multiple indices or a single index with multiple namespaces to test various chunk sizes. Generate embeddings for each size and save them in your index. Then, run queries to evaluate quality and compare performance across different chunk sizes.

Remember, it’s an iterative process. Keep testing until you find the chunk size that works best for your content and expected queries.

--

--

Shweta Gargade

Senior Data Scientist | NLP & Speech Researcher | Helping Freshers | LinkedIn:@shwetagargade