LLM | AI | HAYSTACK | DEEPSET | TEXT ANNOTATION |

Converting data into SQuAD format for fine-tuning LLM models

Introduction to the Haystack annotation tool and its implementation

Chinmay Bhalerao

--

Image credits: official repo of Haystack

| LINKEDIN | TWITTER | MEDIUM | SUBSTACK |

SQuAD

SQuAD (Stanford Question Answering Dataset) is a popular format for training and evaluating language models like LLM (Large Language Models) for question-answering tasks.

The SQuAD format includes a set of context passages and corresponding questions that are associated with a specific answer. Each context passage is accompanied by a list of questions and the corresponding answer(s) for each question. In SQuAD format, the context passage and the question are provided as strings, while the answer is represented as a text span. Following is an example of the SQuAD format data.

{
"data": [
{
"paragraphs": [
{
"context": "The quick brown fox jumps over the lazy dog.",
"qas": [
{
"question": "What does the fox jump over?",
"id": "q1",
"answers": [
{
"text": "the lazy dog",
"answer_start": 32
}
]
}
]
}
],
"title": "Example"
}
],
"version": "2.0"
}

In general cases, we always have data in the form of paragraphs and documents. The data like this, we can find on any website or at any place where data emerges. Even though traditional datasets are always in the form of a series of documents of either text files or word files, The problem with it is we can not feed it directly to LLM models as it requires data in a specific format. There are few pipelines that can directly take text files. SQuAD is one of the formats that work well with many LLMs.

Text annotation

Text annotation is the process of adding structured information to unstructured text data in order to make it more understandable and useful for downstream applications. In the context of LLM (Large Language Models), text annotation is often used to train and improve the accuracy of these models for specific natural languages processing tasks, such as named entity recognition, part-of-speech tagging, and sentiment analysis.

There are several types of text annotations that are commonly used in LLM, including:

Named Entity Recognition (NER): NER involves identifying and classifying named entities in text, such as people, organizations, locations, and dates.

Part-of-Speech (POS) Tagging: POS tagging involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, or adverb.

Dependency Parsing: Dependency parsing involves identifying the syntactic relationships between words in a sentence, such as subject-verb or object-preposition.

Sentiment Analysis: Sentiment analysis involves identifying the polarity (positive, negative, or neutral) of a sentence or document.

Coreference Resolution: Coreference resolution involves identifying when two or more words or phrases in a text refer to the same entity.

Text annotation is typically performed manually by human annotators, although there are also some automated tools available that can assist with certain types of annotations. The resulting annotated text data can then be used to train and improve the accuracy of LLM models for specific natural language processing tasks.

Let's convert our raw data into SQuAD format. There are many ways we can do this conversion but I tell you what I have explored.

Step 1:

Finalize your dataset

Finalize the text documents that you want to feed to LLM models for fine-tuning. then bring all stuff to a common folder.

Step 2:

Go to the Haystack annotation tool and register yourself

link: Haystack annotation tool

The interface of the Haystack annotation tool [Image by author]

You will get a window like this. You can see projects and other options.

In the right corner, you will see a window like this.

Image by author

Create a new project and give a proper name to it.

Step 3:

From actions, go inside the project. Then click on import

Image by author

Drag your documents here.

Image by author

Click on add custom questions.

Image by author

Select a paragraph where there is an answer to your question. Select the answer category as short, concise, YES, or NO. [These things Haystack added recently].

Step 4:

After creating all questions, export them in your required format [Excel, CSV, and SQaAD ].

We will export it in SQuAD.

Fig: Initial text file [Image by author]
SQuAD format converted file [Image by author]

As displayed, the Initial file is a pure text file [unstructured], and below that, you can see a SQuAD format converted JSON file [structured]. Basically, we are assigning labels to text as we do in supervised learning. After this, you can feed this file to any model like BERT, etc.

--

--

Chinmay Bhalerao

AI-ML Researcher & Developer | 3 X Top writer in Artificial intelligence, Computer vision & Object detection | Mathematical Modelling & Simulations