Document Chunking for AI RAG Applications

David Richards
13 min readDec 18, 2023

--

When developing a RAG application, it is important to have a well established document chunking pattern for ingesting content. While there are many libraries available to achieve this, it is important to understand the underlying mechanics of this process as it is the bedrock of your RAG AI application.

Implications for AI Performance

Performance enhancements are one of the primary benefits of document chunking. Models without chunking can struggle with retrieving the most pertinent information, especially when processing long documents that contain a mix of relevant and irrelevant content. Chunking not only makes data more manageable but also enables more accurate matching between the information request and the retrieved document sections.

Moreover, document chunking allows for parallel processing of chunks, leading to improved scalability and quicker response times. It also enables models to focus on the most relevant parts of documents, thereby improving the overall precision and quality of the generated content.

Methodologies for Document Chunking

  1. Fixed-Length Chunking: This simple strategy involves splitting documents into chunks of a predetermined size. While straightforward, this method might not always align chunks with the logical structure of the documents, potentially cutting off important information across chunks.
  2. Context-Aware Chunking: This more sophisticated approach takes into account the logical and semantic structure of documents. It aims to preserve coherent sections like paragraphs and sections, using natural breaks in the text to define chunk boundaries.
  3. Sliding Window Chunking: In this method, chunks have some overlap, ensuring that information at the edges of chunks is not lost. Sliding window chunking provides a balance between fixed-length and context-aware techniques.
  4. Adaptive Chunking: Leveraging machine learning, adaptive chunking dynamically alters chunk sizes and boundaries based on the context and the retrieval task at hand. This method is particularly effective but also the most computationally intensive.

Test Document

For our test document, we will use a large pdf from Amazon documentation to test the performance of our chunking.

https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf

Lets try some of the document chunking libraries we have available and see the results.

LangChain

RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 300,
length_function = len,
is_separator_regex = False,
)
texts = text_splitter.create_documents([fullDoc])
for i, text in enumerate(texts):
print(f'doc: #{i}', text)

#doc: #0 page_content='Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques'
#doc: #1 page_content='privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using'
#doc: #2 page_content="foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get"

TokenTextSplitter

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
print(f'doc: #{i}', text)

doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless
doc: #1 experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and configurations – Run model inference by sending prompts
doc: #2 using different configurations and foundation models to generate responses. You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface. When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation
doc: #3 of responses. •Create applications that reason through how to help a customer – Build agents that use foundation models, make API calls, and (optionally) query knowledge bases in order to reason through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to specific tasks and domains with training data – Customize an Amazon Bedrock foundation model by providing training data for fine-tuning or continued

SpacyTextSplitter

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
print(f'doc: #{i}', text)

# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
# doc: #1 Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
# doc: #2 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

NLTKTextSplitter

from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=400)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
print(f'doc: #{i}', text)
# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
# doc: #1 Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
# doc: #2 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

Custom Solutions

Token + Sentence Splitter

In this solution we will prioritize keeping sentences together to ensure we have a cohesive chunk of text that can be used in isolation. To ensure this chunk of text works well with our AI models, we will use an overall token limit to ensure the chunk will be useful in our contexts.

def chunk_tokens(document: str, token_limit: int = 100):
enc = tiktoken.encoding_for_model('gpt-3.5-turbo')
chunks = []
tokens = enc.encode(document, disallowed_special=())

while tokens:
chunk = tokens[:token_limit]
chunk_text = enc.decode(chunk)
last_punctuation = max(
chunk_text.rfind("."),
chunk_text.rfind("?"),
chunk_text.rfind("!"),
chunk_text.rfind("\n"),
)
if last_punctuation != -1 and len(tokens) > token_limit:
chunk_text = chunk_text[: last_punctuation + 1]
cleaned_text = chunk_text.replace("\n", " ").strip()
if cleaned_text and (not cleaned_text.isspace()):
chunks.append(cleaned_text)
tokens = tokens[len(enc.encode(chunk_text, disallowed_special=())):]

return chunks
texts = chunk_tokens(fullDoc)
for i, text in enumerate(texts):
print(f'doc: #{i}', text)

# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
# doc: #1 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.
# doc: #2 Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and configurations – Run model inference by sending prompts using different configurations and foundation models to generate responses.
# doc: #3 You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface. When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation of responses.

Haystack

Word Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="word", split_length=100, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
print(f'doc: #{i}', doc.content)

# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications
# doc: #1 using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and configurations – Run model inference by sending prompts using different configurations and foundation models to generate responses. You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface. When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment
# doc: #2 response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation of responses. •Create applications that reason through how to help a customer – Build agents that use foundation models, make API calls, and (optionally) query knowledge bases in order to reason through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to specific tasks and domains with training data – Customize an Amazon Bedrock foundation model by providing training data

Sentence Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="sentence", split_length=5, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
print(f'doc: #{i}', doc.content)


# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and configurations – Run model inference by sending prompts using different configurations and foundation models to generate responses. You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface.
# doc: #1 When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation of responses. •Create applications that reason through how to help a customer – Build agents that use foundation models, make API calls, and (optionally) query knowledge bases in order to reason through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to specific tasks and domains with training data – Customize an Amazon Bedrock foundation model by providing training data for fine-tuning or continued-pretraining in order to adjust a model's parameters and improve its performance on specific tasks or in certain domains. •Improve your FM-based application's efficiency and output – Purchase Provisioned Throughput for a foundation model in order to run inference on models more efficiently and at discounted rates.
# doc: #2 •Determine the best model for your use case – Evaluate outputs of different models with built-in or custom prompt datasets to determine the model that is best suited for your application. Note Model evaluation is in preview release for Amazon Bedrock and is subject to change. To use model evaluation jobs, you must be in either US East (N. Virginia) Region or US West (Oregon) Region. •Prevent inappropriate or unwanted content – Use Guardrails for Amazon Bedrock to implement safeguards for your generative AI applications.
# doc: #3 Note Guardrails for Amazon Bedrock is in limited preview release. To request access, contact your AWS account manager. Supported models in Amazon Bedrock For information about the models that Amazon Bedrock supports, see Supported foundation models in Amazon Bedrock. Important Before you can use any of the foundation models, you must request access to that model. If you try to use the model (with the API or within the console) before you have requested access to it, you will receive an error message.

Passage Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="passage", split_length=1, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
print(f'doc: #{i}', doc.content)

# doc: #0 Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and configurations – Run model inference by sending prompts using different configurations and foundation models to generate responses. You can use the API or the text, image, and chat playgrounds in the console to experiment in a graphical interface. When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge bases by uploading data sources to be queried in order to augment a foundation model's generation of responses. •Create applications that reason through how to help a customer – Build agents that use foundation models, make API calls, and (optionally) query knowledge bases in order to reason through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to specific tasks and domains with training data – Customize an Amazon Bedrock foundation model by providing training data for fine-tuning or continued-pretraining in order to adjust a model's parameters and improve its performance on specific tasks or in certain domains. •Improve your FM-based application's efficiency and output – Purchase Provisioned Throughput for a foundation model in order to run inference on models more efficiently and at discounted rates. •Determine the best model for your use case – Evaluate outputs of different models with built-in or custom prompt datasets to determine the model that is best suited for your application. Note Model evaluation is in preview release for Amazon Bedrock and is subject to change. To use model evaluation jobs, you must be in either US East (N. Virginia) Region or US West (Oregon) Region. •Prevent inappropriate or unwanted content – Use Guardrails for Amazon Bedrock to implement safeguards for your generative AI applications. Note Guardrails for Amazon Bedrock is in limited preview release. To request access, contact your AWS account manager. Supported models in Amazon Bedrock For information about the models that Amazon Bedrock supports, see Supported foundation models in Amazon Bedrock. Important Before you can use any of the foundation models, you must request access to that model. If you try to use the model (with the API or within the console) before you have requested access to it, you will receive an error message. For more information, see Model access. Supported models in Amazon Bedrock 2Amazon Bedrock User Guide Supported Regions For information about the Regions that Amazon Bedrock supports, see Amazon Bedrock endpoints and quotas . The following features are only available in US East (N. Virginia) and US West (Oregon). •Model evaluation •Agents for Amazon Bedrock •Knowledge base for Amazon Bedrock Amazon Bedrock pricing When you sign up for AWS, your AWS account is automatically signed up for all services in AWS, including Amazon Bedrock. However, you are charged only for the services that you use. To see your bill, go to the Billing and Cost Management Dashboard in the AWS Billing and Cost Management console. To learn more about AWS account billing, see the AWS Billing User Guide. If you have questions concerning AWS billing and AWS accounts, contact AWS Support. With Amazon Bedrock, you pay to run inference on any of the third-party foundation models. Pricing is based on the volume of input tokens and output tokens, and on whether you have purchased provisioned throughput for the model. For more information, see the Model providers page in the Amazon Bedrock console. For each model, pricing is listed following the model version. For more information about purchasing Provisioned Throughput, see Provisioned Throughput. For more information, see Amazon Bedrock Pricing. Supported Regions 3Amazon Bedrock User Guide Set up Amazon Bedrock Before you use Amazon Bedrock for the first time, complete the following tasks. Once you have set up your account and requested model access in the console, you can set up the API. Important Before you can use any of the foundation models, you must request access to that model. If you try to use the model (with the API or within the console) before you have requested access to it, you will receive an error message. For more information, see Model access. Setup tasks •Sign up for an AWS account •Create an administrative user •Grant programmatic access •Console access •Model access •Set up the Amazon Bedrock API Sign up for an AWS account If you do not have an AWS account, complete the following steps to create one. To sign up for an AWS account 1. Open https://portal.aws.amazon.com/billing/signup. 2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad. When you sign up for an AWS account, an AWS account root user is created. The root user has access to all AWS services and resources in the account. As a security best practice, assign administrative access to an administrative user, and use only the root user to perform tasks that require root user access.

References

LangChain — https://python.langchain.com/docs/get_started/introduction

Haystack documentation — https://docs.haystack.deepset.ai/docs

--

--

David Richards

Founder @ parallellabs.app // davidrichards.tech // Principal Software Engineer. 10+ years working in Bay Area tech, most recently TikTok and Salesforce.