Document Chunking for AI RAG Applications

13 min readDec 18, 2023

When developing a RAG application, it is important to have a well established document chunking pattern for ingesting content. While there are many libraries available to achieve this, it is important to understand the underlying mechanics of this process as it is the bedrock of your RAG AI application.

Implications for AI Performance

Performance enhancements are one of the primary benefits of document chunking. Models without chunking can struggle with retrieving the most pertinent information, especially when processing long documents that contain a mix of relevant and irrelevant content. Chunking not only makes data more manageable but also enables more accurate matching between the information request and the retrieved document sections.

Moreover, document chunking allows for parallel processing of chunks, leading to improved scalability and quicker response times. It also enables models to focus on the most relevant parts of documents, thereby improving the overall precision and quality of the generated content.

Methodologies for Document Chunking

Fixed-Length Chunking: This simple strategy involves splitting documents into chunks of a predetermined size. While straightforward, this method might not always align chunks with the logical structure of the documents, potentially cutting off important information across chunks.
Context-Aware Chunking: This more sophisticated approach takes into account the logical and semantic structure of documents. It aims to preserve coherent sections like paragraphs and sections, using natural breaks in the text to define chunk boundaries.
Sliding Window Chunking: In this method, chunks have some overlap, ensuring that information at the edges of chunks is not lost. Sliding window chunking provides a balance between fixed-length and context-aware techniques.
Adaptive Chunking: Leveraging machine learning, adaptive chunking dynamically alters chunk sizes and boundaries based on the context and the retrieval task at hand. This method is particularly effective but also the most computationally intensive.

Test Document

For our test document, we will use a large pdf from Amazon documentation to test the performance of our chunking.

https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf

Lets try some of the document chunking libraries we have available and see the results.

LangChain

RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    length_function = len,
    is_separator_regex = False,
)
texts = text_splitter.create_documents([fullDoc])
for i, text in enumerate(texts):
    print(f'doc: #{i}', text)

#doc: #0 page_content='Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques'
#doc: #1 page_content='privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using'
#doc: #2 page_content="foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get"

TokenTextSplitter

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
    print(f'doc: #{i}', text)

doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless
doc: #1 experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and conﬁgurations – Run model inference by sending prompts
doc: #2 using  diﬀerent conﬁgurations and foundation models to generate responses. You can use the API or  the text, image, and chat playgrounds in the console to experiment in a graphical interface.  When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge  bases by uploading data sources to be queried in order to augment a foundation model's  generation
doc: #3 of responses. •Create applications that reason through how to help a customer – Build agents that use  foundation models, make API calls, and (optionally) query knowledge bases in order to reason  through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to speciﬁc tasks and domains with training data – Customize an Amazon  Bedrock foundation model by providing training data for ﬁne-tuning or continued

SpacyTextSplitter

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=400)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
    print(f'doc: #{i}', text)

# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
# doc: #1 Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources.
# doc: #2 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure.

NLTKTextSplitter

from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=400)
texts = text_splitter.split_text(fullDoc)
for i, text in enumerate(texts):
    print(f'doc: #{i}', text)
# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
# doc: #1 Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources.
# doc: #2 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure.

Custom Solutions

Token + Sentence Splitter

In this solution we will prioritize keeping sentences together to ensure we have a cohesive chunk of text that can be used in isolation. To ensure this chunk of text works well with our AI models, we will use an overall token limit to ensure the chunk will be useful in our contexts.

def chunk_tokens(document: str, token_limit: int = 100):
    enc = tiktoken.encoding_for_model('gpt-3.5-turbo')
    chunks = []
    tokens = enc.encode(document, disallowed_special=())

    while tokens:
        chunk = tokens[:token_limit]
        chunk_text = enc.decode(chunk)
        last_punctuation = max(
            chunk_text.rfind("."),
            chunk_text.rfind("?"),
            chunk_text.rfind("!"),
            chunk_text.rfind("\n"),
        )
        if last_punctuation != -1 and len(tokens) > token_limit:
            chunk_text = chunk_text[: last_punctuation + 1]
        cleaned_text = chunk_text.replace("\n", " ").strip()
        if cleaned_text and (not cleaned_text.isspace()):
            chunks.append(cleaned_text)
        tokens = tokens[len(enc.encode(chunk_text, disallowed_special=())):]

    return chunks
texts = chunk_tokens(fullDoc)
for i, text in enumerate(texts):
    print(f'doc: #{i}', text)

# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources.
# doc: #1 With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure.
# doc: #2 Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and conﬁgurations – Run model inference by sending prompts using  diﬀerent conﬁgurations and foundation models to generate responses.
# doc: #3 You can use the API or  the text, image, and chat playgrounds in the console to experiment in a graphical interface.  When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge  bases by uploading data sources to be queried in order to augment a foundation model's  generation of responses.

Haystack

Word Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="word", split_length=100, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
    print(f'doc: #{i}', doc.content)

# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications 
# doc: #1 using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and conﬁgurations – Run model inference by sending prompts using  diﬀerent conﬁgurations and foundation models to generate responses. You can use the API or  the text, image, and chat playgrounds in the console to experiment in a graphical interface.  When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment 
# doc: #2 response generation with information from your data sources – Create knowledge  bases by uploading data sources to be queried in order to augment a foundation model's  generation of responses. •Create applications that reason through how to help a customer – Build agents that use  foundation models, make API calls, and (optionally) query knowledge bases in order to reason  through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to speciﬁc tasks and domains with training data – Customize an Amazon  Bedrock foundation model by providing training data

Sentence Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="sentence", split_length=5, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
    print(f'doc: #{i}', doc.content)


# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and conﬁgurations – Run model inference by sending prompts using  diﬀerent conﬁgurations and foundation models to generate responses. You can use the API or  the text, image, and chat playgrounds in the console to experiment in a graphical interface.
# doc: #1   When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge  bases by uploading data sources to be queried in order to augment a foundation model's  generation of responses. •Create applications that reason through how to help a customer – Build agents that use  foundation models, make API calls, and (optionally) query knowledge bases in order to reason  through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to speciﬁc tasks and domains with training data – Customize an Amazon  Bedrock foundation model by providing training data for ﬁne-tuning or continued-pretraining in  order to adjust a model's parameters and improve its performance on speciﬁc tasks or in certain  domains. •Improve your FM-based application's eﬃciency and output – Purchase Provisioned Throughput for a foundation model in order to run inference on models more eﬃciently and at discounted  rates.
# doc: #2  •Determine the best model for your use case – Evaluate outputs of diﬀerent models with built-in  or custom prompt datasets to determine the model that is best suited for your application. Note Model evaluation is in preview release for Amazon Bedrock and is subject to change. To  use model evaluation jobs, you must be in either US East (N. Virginia) Region or US West  (Oregon) Region. •Prevent inappropriate or unwanted content – Use Guardrails for Amazon Bedrock to  implement safeguards for your generative AI applications.
# doc: #3  Note Guardrails for Amazon Bedrock is in limited preview release. To request access, contact  your AWS account manager. Supported models in Amazon Bedrock For information about the models that Amazon Bedrock supports, see Supported foundation  models in Amazon Bedrock. Important Before you can use any of the foundation models, you must request access to that model.  If you try to use the model (with the API or within the console) before you have requested  access to it, you will receive an error message.

Passage Splitter

from haystack.components.preprocessors import DocumentSplitter

preprocessor = DocumentSplitter(split_by="passage", split_length=1, split_overlap=0)
doc = Document(content=fullDoc)
docs = preprocessor.run([doc])
for i, doc in enumerate(docs.get('documents')):
    print(f'doc: #{i}', doc.content)

# doc: #0 Amazon Bedrock also oﬀers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and  evaluate top foundation models for your use cases, privately customize them with your data using  techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that  execute tasks using your enterprise systems and data sources. With Amazon Bedrock's serverless experience, you can get started quickly, privately customize  foundation models with your own data, and easily and securely integrate and deploy them into  your applications using AWS tools without having to manage any infrastructure. Topics •Features of Amazon Bedrock •Supported models in Amazon Bedrock •Supported Regions •Amazon Bedrock pricing Features of Amazon Bedrock Take advantage of Amazon Bedrock foundation models to explore the following capabilities: •Experiment with prompts and conﬁgurations – Run model inference by sending prompts using  diﬀerent conﬁgurations and foundation models to generate responses. You can use the API or  the text, image, and chat playgrounds in the console to experiment in a graphical interface.  When you're ready, set up your application to make requests to the InvokeModel APIs. •Augment response generation with information from your data sources – Create knowledge  bases by uploading data sources to be queried in order to augment a foundation model's  generation of responses. •Create applications that reason through how to help a customer – Build agents that use  foundation models, make API calls, and (optionally) query knowledge bases in order to reason  through and carry out tasks for your customers. Features of Amazon Bedrock 1Amazon Bedrock User Guide •Adapt models to speciﬁc tasks and domains with training data – Customize an Amazon  Bedrock foundation model by providing training data for ﬁne-tuning or continued-pretraining in  order to adjust a model's parameters and improve its performance on speciﬁc tasks or in certain  domains. •Improve your FM-based application's eﬃciency and output – Purchase Provisioned Throughput for a foundation model in order to run inference on models more eﬃciently and at discounted  rates. •Determine the best model for your use case – Evaluate outputs of diﬀerent models with built-in  or custom prompt datasets to determine the model that is best suited for your application. Note Model evaluation is in preview release for Amazon Bedrock and is subject to change. To  use model evaluation jobs, you must be in either US East (N. Virginia) Region or US West  (Oregon) Region. •Prevent inappropriate or unwanted content – Use Guardrails for Amazon Bedrock to  implement safeguards for your generative AI applications. Note Guardrails for Amazon Bedrock is in limited preview release. To request access, contact  your AWS account manager. Supported models in Amazon Bedrock For information about the models that Amazon Bedrock supports, see Supported foundation  models in Amazon Bedrock. Important Before you can use any of the foundation models, you must request access to that model.  If you try to use the model (with the API or within the console) before you have requested  access to it, you will receive an error message. For more information, see Model access. Supported models in Amazon Bedrock 2Amazon Bedrock User Guide Supported Regions For information about the Regions that Amazon Bedrock supports, see Amazon Bedrock endpoints  and quotas . The following features are only available in US East (N. Virginia) and US West (Oregon). •Model evaluation •Agents for Amazon Bedrock •Knowledge base for Amazon Bedrock Amazon Bedrock pricing When you sign up for AWS, your AWS account is automatically signed up for all services in AWS,  including Amazon Bedrock. However, you are charged only for the services that you use. To see your bill, go to the Billing and Cost Management Dashboard in the AWS Billing and Cost  Management console. To learn more about AWS account billing, see the AWS Billing User Guide. If  you have questions concerning AWS billing and AWS accounts, contact AWS Support. With Amazon Bedrock, you pay to run inference on any of the third-party foundation models.  Pricing is based on the volume of input tokens and output tokens, and on whether you have  purchased provisioned throughput for the model. For more information, see the Model providers page in the Amazon Bedrock console. For each model, pricing is listed following the model version.  For more information about purchasing Provisioned Throughput, see Provisioned Throughput. For more information, see Amazon Bedrock Pricing. Supported Regions 3Amazon Bedrock User Guide Set up Amazon Bedrock Before you use Amazon Bedrock for the ﬁrst time, complete the following tasks. Once you have set  up your account and requested model access in the console, you can set up the API. Important Before you can use any of the foundation models, you must request access to that model.  If you try to use the model (with the API or within the console) before you have requested  access to it, you will receive an error message. For more information, see Model access. Setup tasks •Sign up for an AWS account •Create an administrative user •Grant programmatic access •Console access •Model access •Set up the Amazon Bedrock API Sign up for an AWS account If you do not have an AWS account, complete the following steps to create one. To sign up for an AWS account 1. Open https://portal.aws.amazon.com/billing/signup. 2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and entering a veriﬁcation code  on the phone keypad. When you sign up for an AWS account, an AWS account root user is created. The root user  has access to all AWS services and resources in the account. As a security best practice, assign  administrative access to an administrative user, and use only the root user to perform tasks  that require root user access.

References

LangChain — https://python.langchain.com/docs/get_started/introduction

Haystack documentation — https://docs.haystack.deepset.ai/docs