Document Chunking Guided Journey: A Greedy Start

Ian Fukushima
Blue Orange Digital

--

Exploring Strategies for Document Segmentation with Python

With the popularization of LLMs APIs, the interest in document search has risen. Frameworks that orchestrate LLMs provide connections and interfaces that interact with vector databases, yet there is still a gap when it comes to algorithms designed to effectively chunk extensive documents.

Such algorithms are important since they improve downstream tasks like summarization and document search, and they are also necessary for optimizing “chat with your data” applications. These applications require the retrieval of specific document excerpts that are concise enough to fit into prompts and sufficiently informative to answer user queries.

The term chunking is being widely used nowadays, but in the scientific literature terms like “text/document segmentation” and “topic modeling” have more frequent utilization. We will embark into three pragmatic approaches — each elucidated with Python code:

Employing a labeled dataset, we will also explore a method to evaluate the efficacy of chunking strategies.

Wiki 50 and Wiki 727K datasets

We will use a labeled dataset, commonly used in text segmentation scientific literature, called Wiki 50. It is a test version of the larger Wiki 727K dataset. It consists of 50 and 727,000 Wikipedia articles, respectively, and each article section is considered a chunk. The dataset is described and provided by Omri Koshorek et al. and is available here.

The code below processes our documents.

import re

cl100k_base = tiktoken.get_encoding("cl100k_base")

enc = tiktoken.Encoding(
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|new_section|>": 100264,
"<|end_sentence|>": 100265,
},
)

def load_and_preprocess(wiki_doc_filename):

with open(wiki_doc_filename, "r") as f:
wiki_doc = f.read()

# Replace the delimitation sections with "<|new_section|>" special token
wiki_doc_sep = re.sub(r'========,\d+,[^\n]+\n', '<|new_section|>', wiki_doc)

# Remove empty sections
wiki_doc_sep = re.sub(
r'(\<\|new_section\|\>)+',
'<|new_section|>',
wiki_doc_sep
)

# Count sections
n_sections = wiki_doc_sep.count("<|new_section|>")

# Utilize special new sentence token
wiki_doc_sep = wiki_doc_sep.replace("\n", "<|end_sentence|>")

# Position of the first token of new chunks (excluding special new section
# token)
doc_enc = enc.encode(wiki_doc_sep, allowed_special="all")
chunk_start_positions = []
for i, token in enumerate(doc_enc):
if token == 100264:
chunk_start_positions.append(i - len(chunk_start_positions))

doc_len = len(doc_enc) - len(chunk_start_positions)

# Position of the sentence that starts a new section
sentences = [s for s in wiki_doc_sep.split("<|end_sentence|>") if s != ""]
chunk_start_sentence_position = [
i for i, s in enumerate(sentences)
if s.startswith("<|new_section|>")
]

return (
wiki_doc_filename,
wiki_doc_sep.replace("<|new_section|>", ""),
wiki_doc,
chunk_start_positions,
chunk_start_sentence_position,
doc_len,
n_sections,
)

We store the data into train and test pandas DataFrames, with the following columns:

  • original: the original article text;
  • text: the article text without the section headers — this is what we expect to see in practice when chunking;
  • chunk_start_position (THIS IS OUR TARGET): a list with the positions of the first token of each chunk;
  • chunk_start_sentences_position: a list with the position of the first sentence that start each chunk;
  • doc_len: number of tokens in the article;
  • n_sections: number of sections/chunks in the article.
The dataset used to train and test our chunking algorithms

Chunking Algorithm: Greedy Approximation of Token Length

As a baseline, the greedy approximation of token length, which is semantically agnostic, offers a quick and straightforward strategy for initial prototyping or applications where rapid deployment takes precedence.

Greedy algorithm illustration

Objective: Divide a document into text chunks, where each chunk’s token length approximates a predefined desired length, with new chunks starting immediately after a specified end-token.

Definitions:

“desired_len”: The target number of tokens per chunk.

“chunk_end_token”: The token indicating potential starting points for new chunks.

Algorithm Steps:

1. Tokenize the Document

  • Input the raw document text.
  • Utilize the specified token encoding method to tokenize the document.

2. Identify Possible Chunk Start Points

  • Scan through the tokenized document.
  • Every position following a chunk_end_token is logged as a potential starting point for a new chunk.

3. Initialize the First Chunk

The first chunk always starts with the first token of the document.

4. Greedy Chunk Formation

For each potential chunk starting point:

  • Calculate the ‘error’ for both the current candidate and the next one: the absolute difference between the desired_len and the resulting chunk length.
  • If the current candidate provides a chunk size closer to desired_len than the next candidate, finalize the current chunk and prepare to begin a new one from the current candidate point.
  • Otherwise, consider the next candidate in the subsequent iteration.

5. Output the Chunks

Return the actual starting points of the determined chunks.

Python implementation:

def greedy_predict_chunk_startpoints(
document: str,
desired_len: int,
encoder: tiktoken.core.Encoding,
chunk_end_token: int
) -> List[int]:
"""
Returns a list of chunk start points (position of the first token of each
chunk) based on "desired_len." The "document" string is encoded using
"encoder." The possible starting points are infered based on
"chunk_end_token": the token subsequent to a end token may
be the start of a chunk.

The document is split based "desired_len", which is the desired number of
tokens in each chunk, which won't be exactly achieved, but approximated.

"""
# List all possible startpoints (position of token subsequent to a
# "chunk_end_token")
possible_startpoints = [
i + 1
for i, token in enumerate(encoder.encode(
document, allowed_special="all"))
if token == chunk_end_token
]

# Starting at the beggining of the document (first startpoint is zero)
startpoints_pred = [0]

# Chose the startpoints that makes chunk have a length closer to the
# desired
startpoints_zip = zip(possible_startpoints[:-1], possible_startpoints[1:])
for candidate, next_candidate in startpoints_zip:

candidate_error = abs((candidate - startpoints_pred[-1]) - desired_len)
next_candidate_error = abs(
(next_candidate - startpoints_pred[-1]) -
desired_len
)

# If next candidate is worse, all subsequents one will be too
if candidate_error < next_candidate_error:
startpoints_pred.append(candidate)

return startpoints_pred

Chunking visualization

Here is an example output. The color changes represent the source text’s actual sections, while each paragraph is a predicted chunk. In this particular case, we can notice that the generated chunks are smaller than the source text’s sections.

Example output: color changes represent the original source text’s actual sections, while each paragraph is a chunk generated by the greedy algorithm

Quantitative Evaluation

Error Definition

The error is defined as the cumulative difference between the actual and predicted positions of the starting token of a chunk. For instance, let’s consider a document with actual token starting positions actual = [0, 50, 100, 200] and predicted positions pred = [0, 49, 105, 180]. The error, in this case, would be computed as |0–0| + |50–49| + |100–105| + |200–180| = 26. Should the arrays of predicted and actual starting positions diverge in length, the last value of the shorter array is repeated to equalize them before calculating the error. An example being: if pred = [0, 49, 150], the error becomes |0–0| + |50–49| + |100–150| + |200–150| = 101.

Evaluation Metric

Subsequently, evaluating an algorithm’s overall performance becomes feasible by employing metrics like the Root Mean Squared Error, Mean Absolute Error, etc.

Here is the error distribution for the greedy approach:

Statistics for the error produced by the greedy algorithm

At the end of our journey through the 3 algorithms, we will quantitatively compare all 3 algorithms’ error rates.

A Note of Caution

It’s imperative to recognize the specificity and constraints of our train and test sets — they define chunks as literal sections, and are reduced in size to speed up the demonstrations. Thus, I encourage you to perceive the quantitative results as an illustrative exercise, and know that it is prudent to engage in experimentation with the approaches on your data, ensuring the selected strategy works in your application.

Also, our first algorithm focuses on the length of each chunk, ignoring the semantic intricacies within. It offers a quick solution for prototyping, and the forthcoming algorithms in this series will provide more sophisticated chunking solutions.

--

--

Ian Fukushima
Blue Orange Digital

Machine Learning Engineer and Data Scientist. Msc. in Applied Economics.