Harnessing Transformers to Chunk Documents

Exploring an Open Source Model for Context-Aware Document Chunking

Ian Fukushima
Blue Orange Digital
5 min readFeb 28, 2024

--

The performance of Retrieval Augmented Generation (RAG) applications greatly depends on the system’s ability to provide good context for Large Language Models (LLMs) to answer queries. Document chunking is a crucial part of these applications, since documents are usually too large to be entirely added to LLMs context, and doing so is also inefficient. We often want to provide diverse pieces of information for a nuanced and complete answer. This is the motivation behind splitting documents into well-sized and quality chunks which provide us the ability to search for a variety of self-contained chunks to provide context to the LLM.

This article is a practical guide on how to use a pre-trained transformer model hosted on Hugging Face to chunk a document.

Transformer Architecture

Before jumping into the practical code, let’s quickly discuss the transformer model that we will be using. It is based on the work of Michal Lukasik, Boris Dadachev, Gonçalo Simões and Kishore Papineni in their paper called “Text Segmentation by Cross Segment Attention”. They propose and compare 3 neural network architectures designed for text segmentation (document chunking) and benchmark them on several datasets.

Our article will use a transformer model based on their “Cross-Segment BERT” architecture, but implemented as a fine-tuning of the DistilBERT model on 40,000 Wikipedia articles. Blue Orange Digital trained and published our model on Hugging Face. You can find further descriptions and the model files at https://huggingface.co/BlueOrangeDigital/distilbert-cross-segment-document-chunking.

Chunking a Document

For our example, we will chunk an arbitrary Wikipedia article. The whole chunking process is illustrated in the image above. It involves classifying each subsequent sentence pair with our transformer which returns a numeric label for each pair of input sentences. A positive label (1) means that the sentences are not from the same chunk, i.e., the sentences mark a chunk boundary.

The article text in our example is defined below, and split into sentences.

article = """
Ane Mihailovich is a retired Yugoslavian-American soccer player.
He spent at least four seasons in the American Soccer League, four in the North American Soccer League and one in the Major Indoor Soccer League.
He also earned five caps with the United States men's national soccer team in 1977.
Mihailovich spent the 1973 and 1974 seasons with the Cleveland Stars in the American Soccer League (ASL).
In 1976, Mihailovich signed with the expansion Los Angeles Skyhawks.
The Skyhawks went to the ASL title game where the game was tied 1-1 until the Skyhawks’ Steve Ralbovsky was tripped in the penalty area.
Mihailovich converted the penalty, beating New York Apollo goalkeeper Gerard Joseph in the lower left hand corner, and the Skyhawks won the game, 2-1.
In 1977, Mihailovich jumped the first division Los Angeles Aztecs of the North American Soccer League (NASL).
The Aztecs traded him to the Washington Diplomats six games into the 1978 season.
The move to the Dips brought a move from forward to defense for Mihailovich.
At the end of the 1979 season, the Dips sent Mihailovich to the San Jose Earthquakes in exchange for the Earthquakes first round draft pick in 1982. for the 1980 season, his last in the NASL.
While Mihailovich had played as a forward with the Skyhawks, he moved between the midfield and defense in the NASL.
He also earned five caps with the U.S. national team in 1977.
His first cap was a 3-1 loss to Guatemala on September 18, 1977.
His last cap came less than a month later in a 1-0 win over China on October 10, 1977.
Ane is now retired and living in Michigan with his wife Patricia.
They have 2 children, daughter Nicole and son Sasha.
They have 6 grandchildren.
Ane is currently coaching Crestwood High School's Boys Varsity Soccer.
He won coach of the year in 2013.
"""


ordered_sentences = article.split('\n')[:-1]

The transformer model takes as input the concatenation of two sentences, and outputs 0 if the two sentences are from the same section/chunk, and 1 if they are from different ones. So our sentences need a tiny bit of preprocessing:

  • Ensure that the sentences do not exceed the token limit of our transformer;
  • Concatenate subsequent sentences into pairs, separated by the special [SEP] token.
from transformers import DistilBertTokenizer

model_name = "BlueOrangeDigital/distilbert-cross-segment-document-chunking"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

def right_truncate_sentence(sentence, tokenizer, max_len):
tokenized = tokenizer.encode(sentence)[1:-1]
if len(tokenized) > max_len:
print("cut")
return tokenizer.decode(tokenized[:max_len])


def left_truncate_sentence(sentence, tokenizer, max_len):
tokenized = tokenizer.encode(sentence)[1:-1]
if len(tokenized) > max_len:
print("cut")
return tokenizer.decode(tokenized[-max_len:])

def bucket_pair(left_sentence, right_sentence, tokenizer, max_len):
return left_truncate_sentence(left_sentence, tokenizer, max_len) + " [SEP] " + \
right_truncate_sentence(right_sentence, tokenizer, max_len)

MAX_LEN = 255
pairs = [
bucket_pair(ordered_sentences[i], ordered_sentences[i+1], tokenizer, MAX_LEN)
for i in range(0, len(ordered_sentences) - 1)
]

The `pairs` list can be passed directly to a Hugging Face text classification pipeline:

from transformers import (
AutoModelForSequenceClassification,
TextClassificationPipeline
)

model_name = "BlueOrangeDigital/distilbert-cross-segment-document-chunking"

id2label = {0: "SAME", 1: "DIFFERENT"}
label2id = {"SAME": 0, "DIFFERENT": 1}

model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id
)

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)

predictions = pipe(pairs)

As previously mentioned, the `predictions` output will be indicating, for each sentence pair in `pairs`, if they belong to the same chunk or not. To actually arrive at the chunks, we go back to the `ordered_sentences` list and join sentences based on our predictions.

n = len(ordered_sentences)
chunks_breaks = [
i+1
for i, pred in enumerate(predictions)
if pred["label"] != "SAME"
]

chunks = [
"\n".join(ordered_sentences[i:j])
for i, j in zip([0] + chunks_breaks, chunks_breaks + [n])
]
print(f"Document split into {len(chunks)} chunks.")
print(chunks)

Which outputs:

Document split into 7 chunks.
[
"Ane Mihailovich is a retired Yugoslavian-American soccer player.\nHe spent at least four season in the American Soccer League, four in the North American Soccer League and one in the Major Indoor Soccer League.\nHe also earned five caps with the United States men's national soccer team in 1977.",
...
"Ane is now retired and living in Michigan with his wife Patricia.\nThey have 2 children, daughter Nicole and son Sasha.\nThey have 6 grandchildren.",
"Ane is currently coaching Crestwood High School's Boys Varsity Soccer.\nHe won coach of the year in 2013."
]

Note that `chunk_breaks` elements are integers that indicate the index of sentences that are the start of a chunk.

Closing Thoughts

This article has demonstrated a method to chunk documents by using an open source transformer model. It is important to highlight that the model we used was specifically trained on Wikipedia data, and using it on other datasets must be approached with careful consideration.

If you are interested in other context-aware approaches to document chunking, take a look at our articles linked below, which not only delve into alternative approaches, but also discuss a way to quantitatively evaluate each strategy.

Thank you for reading!

--

--

Ian Fukushima
Blue Orange Digital

Machine Learning Engineer and Data Scientist. Msc. in Applied Economics.