Document Chunking Guided Journey: Final Ascent with Supervised Learning

Exploring Strategies for Document Segmentation with Python

Ian Fukushima
Blue Orange Digital
5 min readNov 2, 2023

--

Having ventured through greedy and TextTiling algorithms, we now prepare for the final ascent to an approach that leverage transformers for text embedding, combined with similarity scores and supervised learning.

We continue to utilize the Wiki 50 dataset by Omri Koshorek et al. and available here, and you can refer to the first article if you need details of how to preprocess the data.

Note: This is the final article of a 3 part series. You can check out the algorithms previously presented in the stories below:

Chunking Algorithm: Embedding Similarity Score + supervised classification

In the TextTiling approach, we compare blocks. We can have a more granular comparison by calculating similarity scores at the sentence level, not only between subsequent sentences, but for all sentences that are in a range “k.”

Diverging from the block comparison methodology inherent in TextTiling, this approach goes beyond the comparison of subsequent blocks. More specifically, it calculates similarity scores that span across all sentences within a specified distance, denoted as “k”, as illustrated in the diagram below:

The similarity scores can be used as features for a binary classification problem of identifying if a sentence is the first of a chunk or not.

With the features and labels in hand, we can train any classifier and apply standard ML practices. Below is a simplified approach of the features creation and the model training steps:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from sentence_transformers import SentenceTransformer

def get_features(
document: str,
filename: str,
chunk_sentence_start_positions: List[int],
encoder: tiktoken.core.Encoding,
model: SentenceTransformer,
sentence_end_string: str,
K: int,
train: bool = True,
):
"""
Parameters:
- document: The text to be divided into chunks based on sentence
boundaries.
- filename: Identifier for the document, which works as a unique ID.
- chunk_sentence_start_positions: Positions of the sentences that initiate
a new chunk. For example, if the positions are [0, 4], then the first and
fifth sentences start a new chunk.
- encoder: Tokenizer used to convert text into
tokens.
- model: Model used to generate embeddings for sentences.
- sentence_end_string (str): A special string that indicates the end of a
sentence.
- K (int): Number of adjacent sentences to the left and right of a given
sentence to calculate similarity with.
- train (bool, optional): Flag to indicate whether the function is being
used for training (True) or testing/prediction (False). Defaults to True.

Returns:
- List[List[Union[bool, float, int, str]]]: A list of features for each
sentence in the document. Each entry in the list contains:
- A boolean indicating if the sentence starts a new chunk.
- K similarity scores with K subsequent and preceding sentences.
- The number of sentences in the current chunk (if in train mode).
- The start position of the sentence in the document.
- The filename (ID) of the document.

"""
# List all possible startpoints (position of token subsequent to a
# "sentence_end_string")
sentence_end_token = encoder.encode(
sentence_end_string, allowed_special={sentence_end_string})
assert len(sentence_end_token) == 1

# Sentences
sentences = [
s + sentence_end_string
for s in document.split(sentence_end_string)
if s != ""
]
sentence_token_len = [
len(encoder.encode(s, allowed_special="all"))
for s in sentences
]
sentence_start_positions = [0] +np.cumsum(sentence_token_len)[:-1].tolist()
n_sentences = len(sentences)

# Get labeled dataset
embeddings = model.encode(sentences)

# Initialize an empty list to store the similarity scores
results = []

# We will also create a feature for the number of sentences in the
# current chunk. It is initialized as K, since we will start looking
# at the (K+1)th sentence
chunk_sentences = K
for i in range(K, n_sentences - K):
# Calculate the similarity scores with K subsequent and preceding
# sentences
similarity_matrix = cosine_similarity(
[embeddings[i]], embeddings[i - K:i + K + 1]
)[0]

# If train mode, count the sentence, if test/predict, we need to
# calculate this feature iterative outside the feature creation
if train:
chunk_sentences += 1
else:
chunk_sentences = None

results.append(
[i in chunk_sentence_start_positions] + \
list(similarity_matrix) + \
[chunk_sentences, sentence_start_positions[i], filename]
)

# If the sentence is the first of a chunk, restart the chunk
# number of sentences count
if i in chunk_sentence_start_positions:
chunk_sentences = -1

return results


K = 3

model_name="sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(model_name)

similarity_features = train_df.apply(
lambda x: get_features(
x.text, x.filename, x.chunk_start_sentence_position,
enc,
model,
"<|end_sentence|>",
K,
True
)
, axis=1
)

list_of_similarity_features = [
chunk_features
for article_features_list in similarity_features
for chunk_features in article_features_list
]

sl_train_df = pd.DataFrame(
list_of_similarity_features,
columns=["y"] + [f"sim_lag_{k}" for k in range(K, 0, -1)] + \
[f"sim_lead_{k}" for k in range(K+1)] + \
["chunk_sentences", "sentence_start_positions", "filename"]
)


# Train the classifier
features = [f"sim_lag_{k}" for k in range(K, 0, -1)] + \
[f"sim_lead_{k}" for k in range(2, K+1)] + \
["chunk_sentences"]

features = ["chunk_sentences"]
parameters = {
"n_estimators": [3, 4, 5, 10],
"max_depth": [2, 3, 4],
"class_weight": ['balanced_subsample', 'balanced'],
}

rfc = RandomForestClassifier()
clf = GridSearchCV(rfc, parameters, cv=5, return_train_score=True)

clf.fit(X=sl_train_df[features], y=sl_train_df["y"])

clf.best_estimator_

Note that predictions necessitate an iterative approach, given that the construction of chunk_sentences depends on preceding predictions. Also, we can add more features to the classifier, such as leads and lags of the similarity scores. Below is an example output:

Comparative Analysis: Quantitative Evaluation of the Approaches

The statistics of the error, defined as the cumulative difference between the actual and predicted positions of the starting token of a chunk, are presented below:

Error statistics for different document chunking algorithms

Refer to the first article for a detailed definition of the error calculation.

A Note of Caution

The aforementioned results provide a baseline understanding of one way to quantify the quality of chunks. It’s imperative to recognize the specificity and constraints of our train and test sets. Furthermore, the simplification in the supervised learning approach warrants consideration. Thus, I encourage you to perceive the results as an illustrative exercise, and not a table to choose an algorithm from.

It is prudent to engage in experimentation with these approaches on your data, ensuring the selected strategy works in your application.

Other approaches

--

--

Ian Fukushima
Blue Orange Digital

Machine Learning Engineer and Data Scientist. Msc. in Applied Economics.