How Large Language Models and Retrieval Augmented Generation Answer Question over Documents — Text Splitting and Parsing

M. Baddar
BetaFlow
Published in
3 min readAug 13, 2023
source : https://www.expertreviews.co.uk/kitchen/1408597/best-chopping-board

As we have mentioned in our overview article . The Question-Answering over Documents (Docs-QA) System Can be summarized in the following diagram:

Figure 1 : Docs-QA Overview

The first component is the “Text Splitter”. You can think about this case as chopping the food in you mouth as the first step before swallowing and digesting , or as chunking text while reading before understanding.

Before getting deeper into text splitting for NLP , esp. Docs-QA , let’s get back to basics : what is the normal split hierarchy for any text ? It is characters , words, sentences and paragraphs. Let’s call each of then as “Text-Element” . The function of “Text-Splitter” is to combine “Text-Elements” based on some “criteria” and to some level of “granularity” into a set of “chunks”.
To make things more clear, let’s list a set of techniques for text splitting

  1. Fixed width : Chunks are build on N Text-Elements ( characters , words or sentences). Usually words or sentences are used.
Figure 2: Fixed Width Text-Splitting

2. Overlapped with Fixed width

This is an extension to fixed width approach , to avoid “Breaking the context”. It is simply a fixed width chunks , but augmented with per- and post- overlap to avoid breaking the context. For more info about chunk size and overlap , check here

Figure 3: Fixed Width with overlap

3. Semantic Segmentation

This is the most sophisticated approach , and the most well-performing one. The main idea is to group the text chunks in “similar” groups based on some “learnt” model. By similar here we mean similar topics or that we “encode” or create “embedding” for text chunks , then decide some threshold to “group” these chunks. One way to classify “Semantic Text Segmentation” is as follows:

i) Supervised approach : This approach rely on the concept of “Boundary” or “closing” sentence . For example , let’s say we have a chunk of N sentences , usually if the article is well written we have a “closing” sentence that closes the topic to shift the article “context” to the next topic.

The supervised approach uses data set to train a model (usually a generative or a recursive model, like LSTM) to train a binary classifier to mark each sentence as (opening sentence, middle sentence or closing sentence)

ii) Unsupervised Approach: This approach on calculating similarity between consequent sentences and cutting when this similarity runs below some threshold. This threshold can be considered as hyper-parameter that can be tuned.

For more information about “Text Segmentation” , check this perfect article

In AnwserMe, our LLM-powered Documents’ Question-Answering API , we apply the second approach for its efficiency and simplicity. You can follow this 5-min tutorial to try it over your PDF docs and see text-splitting working in action with all other Docs-QA components .

In the next article we introduce the second Docs-QA component : Text-Embedding

--

--

M. Baddar
BetaFlow

AI/ML Engineer, with focus on Generative Modeling. The Mission is enabling individuals and SMEs applying this technology to solve real-life problems.