LLM based context splitter for large documents

Ayham Boucher
6 min readNov 20, 2023

--

Background

When dealing with large text that doesn’t fit the context window of the LLM, it’s necessary to split the text into smaller chunks. This is a common practice when building RAG (Retrieval Augmented Generation) applications. However, existing text splitters may not effectively preserve the semantic relationship between chunks. In post, we will explore the limitations of the RecursiveCharacterTextSplitter and introduce a new solution called the LLM-Based Context Splitter that leverages the power of Large Language Models to maintain context and improve retrieval accuracy.

RAG system overview

Current methods

Langchain provides multiple state of the art text splitters. Ideally, these splitters should keep the semantically related pieces of text together. What “semantically related” means could depend on the type of text.

While some of Langchain text splitters do a good job of this for Markdown docs or Coding files, their most popular splitter, the RecursiveCharacterTextSplitter does not work in many cases.

RecursiveCharacterTextSplitter

The default recommended text splitter by langchain is the RecursiveCharacterTextSplitter. This text splitter takes a list of characters. It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth. By default the characters it tries to split on are [“\n\n”, “\n”, “ “, “”]

In addition to controlling which characters you can split on, you can also control a few other things:

length_function: how the length of chunks is calculated. Defaults to just counting number of characters, but it’s pretty common to pass a token counter here.

chunk_size: the maximum size of your chunks (as measured by the length function).

chunk_overlap: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).

add_start_index: whether to include the starting position of each chunk within the original document in the metadata.

Here’s an example input text to test how well the RecursiveCharacterTextSplitter does. Note that this is a small document so it’s easier to follow, but same concept will apply to larger texts:

The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white.
AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
Gala apples are a popular variety known for their sweet flavor and crisp texture.
They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in fruit displays.
Originally developed in New Zealand in the 1930s, they have since become a favorite in many countries and are widely cultivated for consumption.
Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.
They are usually round or cylindrical in shape and can come in various colors, including red, white, purple, and black.
Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or pickled for different culinary applications.
Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste and aesthetic appeal.

The above text talks about three different topics: Mushrooms, Apples, and radishes. For this run we’re going to use chunk_size = 100 and chunk_overlap = 0. Below are the results:

The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).
— — -
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white.
— — -
AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
— — -
Gala apples are a popular variety known for their sweet flavor and crisp texture.
— — -
They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in
— — -
fruit displays.
— — -
Originally developed in New Zealand in the 1930s, they have since become a favorite in many
— — -
countries and are widely cultivated for consumption.
— — -
Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
— — -
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.
— — -
They are usually round or cylindrical in shape and can come in various colors, including red,
— — -
white, purple, and black.
— — -
Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or
— — -
pickled for different culinary applications.
— — -
Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste
— — -
and aesthetic appeal.
— — -

As you can see, the RecursiveCharacterTextSplitter lost track of the context and failed to preserve it. This is specifically important for the chunk number 3 where it mentions that the Phalloides is the most poisonous of all known mushrooms, and you want to make sure that chunk 1 and 2 are never separated from 3.

New Text Splitter — LLM Based context splitter

To address the limitations of the RecursiveCharacterTextSplitter, I propose the LLM-Based Context Splitter. This new text splitter utilizes the power of LLMs during indexing to split the text into chunks while preserving context.

Algorithm

The LLM-Based Context Splitter leverages the RecursiveCharacterTextSplitter to split the content into chunks. It employs an LLM to compare each chunk with the next one, gauging their similarity using an index. If the index surpasses a defined threshold, the two chunks are considered part of the same context. This process continues in a sliding window slide until the similarity index falls below the threshold or the chunk size nears the maximum desired chunk size.

Here are the results of applying this algorithm to the original input text

The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white. AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
— — -
Gala apples are a popular variety known for their sweet flavor and crisp texture.They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in fruit displays. Originally developed in New Zealand in the 1930s, they have since become a favorite in many countries and are widely cultivated for consumption. Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
— — -
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.They are usually round or cylindrical in shape and can come in various colors, including red, white, purple, and black. Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or pickled for different culinary applications. Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste and aesthetic appeal.
— — -

Through this approach, the LLM effectively detects and maintains context in each split. As a result, the retrieval accuracy of algorithms used in the RAG pipeline, such as Vector Similarity Search, is significantly improved.

Appendix

Similarity Index Prompt

I want you to compare two texts below and tell me if the second text completes the first text’s context or if they can be split apart,

I want you to return score between 0 and 1 where 0 they don’t belong to the same context and should be split, and 1 they definatly belong to the same context

please only tell me the score value and nothing else. For example “0.75”

text 1: {text1}

text 2: {text2}

Think about your answer, does it only contain a float value and nothing else? if not, please correct that and only return the float value

My plan is to add this new splitter to Langchain. Please reach out if you would like to help!

--

--

Ayham Boucher

Head of AI Innovations - IT Strategy at Cornell University