Document Chunking Guided Journey: Final Ascent with Supervised Learning

Exploring Strategies for Document Segmentation with Python

Published in

Blue Orange Digital

5 min readNov 2, 2023

Having ventured through greedy and TextTiling algorithms, we now prepare for the final ascent to an approach that leverage transformers for text embedding, combined with similarity scores and supervised learning.

We continue to utilize the Wiki 50 dataset by Omri Koshorek et al. and available here, and you can refer to the first article if you need details of how to preprocess the data.

Note: This is the final article of a 3 part series. You can check out the algorithms previously presented in the stories below:

Document Chunking Guided Journey: A Greedy Start

Exploring Strategies for Document Segmentation with Python

medium.com

Document Chunking Guided Journey: Midway Camp with TextTiling

Exploring Strategies for Document Segmentation with Python

medium.com

Chunking Algorithm: Embedding Similarity Score + supervised classification

In the TextTiling approach, we compare blocks. We can have a more granular comparison by calculating similarity scores at the sentence level, not only between subsequent sentences, but for all sentences that are in a range “k.”

Diverging from the block comparison methodology inherent in TextTiling, this approach goes beyond the comparison of subsequent blocks. More specifically, it calculates similarity scores that span across all sentences within a specified distance, denoted as “k”, as illustrated in the diagram below:

The similarity scores can be used as features for a binary classification problem of identifying if a sentence is the first of a chunk or not.

With the features and labels in hand, we can train any classifier and apply standard ML practices. Below is a simplified approach of the features creation and the model training steps:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from sentence_transformers import SentenceTransformer

def get_features(
    document: str,
    filename: str,
    chunk_sentence_start_positions: List[int],
    encoder: tiktoken.core.Encoding,
    model: SentenceTransformer,
    sentence_end_string: str,
    K: int,
    train: bool = True,
):
    """
   Parameters:
    - document: The text to be divided into chunks based on sentence       
      boundaries.
    - filename: Identifier for the document, which works as a unique ID.
    - chunk_sentence_start_positions: Positions of the sentences that initiate 
      a new chunk. For example, if the positions are [0, 4], then the first and 
      fifth sentences start a new chunk.
    - encoder: Tokenizer used to convert text into
      tokens.
    - model: Model used to generate embeddings for sentences.
    - sentence_end_string (str): A special string that indicates the end of a 
      sentence.
    - K (int): Number of adjacent sentences to the left and right of a given 
      sentence to calculate similarity with.
    - train (bool, optional): Flag to indicate whether the function is being 
      used for training (True) or testing/prediction (False). Defaults to True.

    Returns:
    - List[List[Union[bool, float, int, str]]]: A list of features for each 
      sentence in the document. Each entry in the list contains:
      - A boolean indicating if the sentence starts a new chunk.
      - K similarity scores with K subsequent and preceding sentences.
      - The number of sentences in the current chunk (if in train mode).
      - The start position of the sentence in the document.
      - The filename (ID) of the document.

    """
    # List all possible startpoints (position of token subsequent to a 
    # "sentence_end_string")
    sentence_end_token = encoder.encode(
        sentence_end_string, allowed_special={sentence_end_string})
    assert len(sentence_end_token) == 1

    # Sentences
    sentences = [
        s + sentence_end_string 
        for s in document.split(sentence_end_string)
        if s != ""
    ]
    sentence_token_len = [
        len(encoder.encode(s, allowed_special="all"))
        for s in sentences
    ]
    sentence_start_positions = [0] +np.cumsum(sentence_token_len)[:-1].tolist()
    n_sentences = len(sentences)

    # Get labeled dataset
    embeddings = model.encode(sentences)

    # Initialize an empty list to store the similarity scores
    results = []

    # We will also create a feature for the number of sentences in the 
    # current chunk. It is initialized as K, since we will start looking
    # at the (K+1)th sentence 
    chunk_sentences = K
    for i in range(K, n_sentences - K):
        # Calculate the similarity scores with K subsequent and preceding         
        # sentences
        similarity_matrix = cosine_similarity(
            [embeddings[i]], embeddings[i - K:i + K + 1]
        )[0]

        # If train mode, count the sentence, if test/predict, we need to
        # calculate this feature iterative outside the feature creation 
        if train:
            chunk_sentences += 1
        else:
            chunk_sentences = None

        results.append(
            [i in chunk_sentence_start_positions] + \
            list(similarity_matrix) + \
            [chunk_sentences, sentence_start_positions[i], filename]
        )

        # If the sentence is the first of a chunk, restart the chunk 
  # number of sentences count
        if i in chunk_sentence_start_positions:
            chunk_sentences = -1

    return results


K = 3

model_name="sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(model_name)

similarity_features = train_df.apply(
        lambda x: get_features(
            x.text, x.filename, x.chunk_start_sentence_position,
            enc,
      model,
            "<|end_sentence|>",
            K,
            True
        )
        , axis=1
)

list_of_similarity_features = [
    chunk_features
    for article_features_list in similarity_features
    for chunk_features in article_features_list
]

sl_train_df = pd.DataFrame(
    list_of_similarity_features,
    columns=["y"] + [f"sim_lag_{k}" for k in range(K, 0, -1)] + \
        [f"sim_lead_{k}" for k in range(K+1)] + \
        ["chunk_sentences", "sentence_start_positions", "filename"]
)


# Train the classifier
features = [f"sim_lag_{k}" for k in range(K, 0, -1)] + \
    [f"sim_lead_{k}" for k in range(2, K+1)] + \
    ["chunk_sentences"]

features = ["chunk_sentences"]
parameters = {
    "n_estimators": [3, 4, 5, 10],
    "max_depth": [2, 3, 4],
    "class_weight": ['balanced_subsample', 'balanced'],
}

rfc = RandomForestClassifier()
clf = GridSearchCV(rfc, parameters, cv=5, return_train_score=True)

clf.fit(X=sl_train_df[features], y=sl_train_df["y"])

clf.best_estimator_

Note that predictions necessitate an iterative approach, given that the construction of chunk_sentences depends on preceding predictions. Also, we can add more features to the classifier, such as leads and lags of the similarity scores. Below is an example output:

Comparative Analysis: Quantitative Evaluation of the Approaches

The statistics of the error, defined as the cumulative difference between the actual and predicted positions of the starting token of a chunk, are presented below:

Error statistics for different document chunking algorithms

Refer to the first article for a detailed definition of the error calculation.

A Note of Caution

The aforementioned results provide a baseline understanding of one way to quantify the quality of chunks. It’s imperative to recognize the specificity and constraints of our train and test sets. Furthermore, the simplification in the supervised learning approach warrants consideration. Thus, I encourage you to perceive the results as an illustrative exercise, and not a table to choose an algorithm from.

It is prudent to engage in experimentation with these approaches on your data, ensuring the selected strategy works in your application.

Other approaches

Summarization can condense information for later information retrieval or search;
Cross Segment Attention Transformer: fine tune HuggingFace cross encoder transformers to segment text;
BERTopic can classify each sentence of your document in a topic;
Text Segmentation as a Supervised Learning Task formulates the problem as a supervised learning problem, and provide trainable neural nets to segment text;

Document Chunking Guided Journey: Final Ascent with Supervised Learning

Exploring Strategies for Document Segmentation with Python

Document Chunking Guided Journey: A Greedy Start

Exploring Strategies for Document Segmentation with Python

Document Chunking Guided Journey: Midway Camp with TextTiling

Exploring Strategies for Document Segmentation with Python

Chunking Algorithm: Embedding Similarity Score + supervised classification

Comparative Analysis: Quantitative Evaluation of the Approaches

A Note of Caution

Other approaches

Written by Ian Fukushima