How to Label Text Data Using LLMs?

Published in

Geek Culture

6 min readMay 5, 2023

In recent years, the field of machine learning has seen exponential growth in the amount of available data. With the rise of big data, many organizations are collecting massive amounts of unstructured data such as text, images, and videos. One of the biggest challenges in utilizing this data is labeling it, which involves manually annotating the data with tags or categories that can be used for training machine learning models.

However, manual labeling can be a time-consuming and expensive process, and can often lead to errors and inconsistencies in the labeling. This is where the power of natural language processing (NLP) and language models like ChatGPT can be harnessed.

In this blog post, we will explore how to use LLMs (Large Language Models) for labeling text data. We will explore 2 simple approaches that can be used for text labeling and we will also look at a sample code for the same.

Zero-Shot Approach

In this approach, you directly ask the model to classify the sample. In the context of any LLM labeling text for sentiment analysis can be very efficient using this method. The prompt for this approach will look like the one shown below.

In the second example, an additional prompt is added after the main sentence in order to provide a problem definition which helps the LLM to understand the human intention and criteria for the required answer.

Few-Shot Approach

Few-shot learning is a method where a model is presented with a small set of high-quality examples consisting of an input and desired output on the target task. This allows the model to better understand the human intention and criteria for the required answers. As a result, few-shot learning often leads to better performance compared to zero-shot learning. Some limitations with this approach are, it requires more token consumption and may exceed the context length limit when input and output text is lengthy. Therefore, few-shot learning offers improved accuracy, but at the cost of increased computational resources and the possibility of exceeding context length limitations.

Here, in the second example, the context samples given to the model have 2 categories: Sports and Politics which is enough for the LLM to understand the scope of the classification. But this can be made more informative by adding a problem definition prompt like: “With context to the above examples classify the following example into the Sports and Politics category.”

Many studies looked into how to construct in-context examples to maximize the performance and observed that choice of prompt format, training examples, and the order of the examples can lead to dramatically different performance, from near random guess to near SoTA. — link

As the Few-Shot approach is known to perform really well in most cases where the given samples have good quality and diversity, this makes me think if this approach can potentially help us in avoiding model fine-tuning for simple to moderate classification tasks, where a limited amount of good-quality samples are enough for an LLM to understand the context of the task.

To get a deeper insight into how the prompt can be engineered and how to select a diverse set of samples for a few-shot approach you can refer to this really informative article by Weng, Lilian.

Code Implementation

To use ChatGPT for labeling text you can use the following code. We will be using the API offered by OpenAI for GPT-3 and Hugging Face model T5 offered by Google.

OpenAI GPT-3 API:

import openai

# Get the OpenAI API key by signing up on OpenAI.
openai.api_key = ""

# Zero-Shot Approach
# Directly asking the model to label the text.
def generate_text_labels(text, categories):
    labels = []
    text_label_mapping = {}

    # String of categories in which you want to classify the text.
    category_str = ", ".join(map(str, categories))
    
    # Sample Prompt
    # Example: I am happy today; Classify this sentence as Positive, Negative or Neutral in one word.
    for i in range(len(text)):
        response = openai.ChatCompletion.create(
                  model="gpt-3.5-turbo",
                  messages=[
                        {"role": "user", "content": f"{text[i]}; Classify this sentence as {category_str} in one word."},
                        # OR - For simple sentiment classification:
                        # {"role": "user", "content": f"Text: {text[i]} \nSentiment in one word:"},
                    ]
                )
        label = response.choices[0]["message"]["content"].strip(".")
        labels.append(label)
        text_label_mapping[text[i]] = label
    
    return labels, text_label_mapping

# Few-Shot Learning
# context = [("Sentence", "Category/Sentiment")]
def generate_text_labels_context(text, context):
    labels = []
    text_label_mapping = {}
    
    # Examples to help model understand the task and context
    context_string = str()
    for i in range(len(text)):
        context_string += f"Text: {context[i][0]}\nSentiment: {context[i][1]}\n"
        # OR
        # context_string += f"Text: {context[i][0]}\nCategory: {context[i][1]}\n"

    # Sample Prompt
    """
    Text: A
    Category/Sentiment: X
    Text: B
    Category/Sentiment: Y
    Text: C
    Category/Sentiment:
    """
    for i in range(len(text)):
        response = openai.ChatCompletion.create(
                  model="gpt-3.5-turbo",
                  messages=[
                        {"role": "user", "content": f"{context_string} \nText: {text[i]} \nSentiment:"},
                        # OR
                        # {"role": "user", "content": f"{context_string} \nText: {text[i]} \nCategory:"},
                    ]
                )
        labels.append(response.choices[0]["message"]["content"].strip("."))
        text_label_mapping[text[i]] = response.choices[0]["message"]["content"].strip(".")
    
    return labels, text_label_mapping

Hugging Face — Google T5:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

# Zero-Shot Approach
# Directly asking the model to label the text.
def generate_text_labels(text, categories):
    labels = []
    text_label_mapping = {}

    # String of categories in which you want to classify the text.
    category_str = ", ".join(map(str, categories))
    
    # Sample Prompt
    # Example: I am happy today; Classify this sentence as Positive, Negative or Neutral in one word.
    for i in range(len(text)):
        input_text = f"{text[i]}; Classify this sentence as {category_str} in one word."
        input_ids = tokenizer(input_text, return_tensors="pt").input_ids

        outputs = model.generate(input_ids)
        label = tokenizer.decode(outputs[0])
        labels.append(label)
        text_label_mapping[text[i]] = label
    
    return labels, text_label_mapping

# Few-Shot Learning
# context = [("Sentence", "Category/Sentiment")]
def generate_text_labels_context(text, context):
    labels = []
    text_label_mapping = {}
    
    # Examples to help model understand the task and context
    context_string = str()
    for i in range(len(text)):
        context_string += f"Text: {context[i][0]}\nSentiment: {context[i][1]}\n"
        # OR
        # context_string += f"Text: {context[i][0]}\nCategory: {context[i][1]}\n"

    # Sample Prompt
    """
    Text: A
    Category/Sentiment: X
    Text: B
    Category/Sentiment: Y
    Text: C
    Category/Sentiment:
    """
    for i in range(len(text)):
        input_text = f"{context_string} \nBased on the above examples determine the sentiment of the following sentence. \nText: {text[i]} \nSentiment:"
        # OR
        # input_text = f"{context_string} \nBased on the above examples determine the category of the following sentence. \nText: {text[i]} \nCategory:"
        input_ids = tokenizer(input_text, return_tensors="pt").input_ids

        outputs = model.generate(input_ids)
        label = tokenizer.decode(outputs[0])
        labels.append(label)
        text_label_mapping[text[i]] = label
    
    return labels, text_label_mapping

Based on the output of the LLM the prompt can be modified to get more accurate results. In some cases, the output of the model might be correct but it wouldn’t be in the required format. In this case, you might have to do some basic output processing to get the results in the required format.

Conclusion

In conclusion, LLMs have proven to be effective tools for labeling text data. The two main approaches for using LLMs for text labeling are Zero-Shot and Few-Shot learning. The article also covered the implementation of both approaches using popular frameworks such as Hugging Face’s transformers library and OpenAI’s GPT-3 API. We can leverage these frameworks to quickly label large amounts of text data accurately, saving us valuable time and resources.

I hope this blog will help you streamline and automate many text data labeling tasks. With the availability of pre-trained models and easy-to-use APIs, we can apply these techniques to a wide range of applications (text summarization, NER, etc.) and domains, making it an essential tool for any data scientist or machine learning practitioner. If you have enjoyed reading this blog, do consider following me on Medium for similar content and connecting with me on LinkedIn!

Check out my other articles here!

References

Prompt Engineering

Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its…

lilianweng.github.io

google/flan-t5-base · Hugging Face

If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been…

huggingface.co

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com