Automatic Labeling of Text for NLP

Abhishek Pawar
AlgoAnalytics
Published in
5 min readDec 3, 2021

--

Text Annotation for NLP is a Complex and Tedious Task

With the recent explosion of social media, news, blog posts, online forums, and internet content in general, huge amounts of data are generated daily. With Velocity, Volume, and Variety (3 Vs of Big Data), enterprises would like to use this data to generate profits, enhance user experience and make smart decisions.

Text data is one of the most growing data types on the internet. This gave rise to significant developments in the field of Natural Language Processing (NLP) over the past few years. However, most of the time the text data may not be labeled for business use. Usually, a human annotator (or Subject Matter Expert) is required to label the text data for use in machine learning algorithms.

This blog describes the challenges in data labeling for text data and compares two potential approaches. For explaining the challenges and the approaches, let us use the context of one specific project we completed for a client. Consider the problem of “detecting confusion” in answers given by the users on an e-learning platform (binary classification). On this platform, users subscribe to a program and submit their answers to questions. The questions are open-ended allowing users to think and respond. A few examples are:

What makes you more productive?

Think of the most effective people you know. Why are they so good at work?

Naïve Approach: Keyword based labeling

Sometimes the “signal” you want to capture from the data can be detected with different words/phrases which can help you label the dataset. For example, in our case of confusion detection, we can simply think of words such as confused, unclear, misunderstood, uncertain, or phrases like I don’t know, I can’t understand this, etc.

With such keywords and their synonyms, we can simply label the sentences as “Confused” which contain these keywords or phrases. However, what if a sentence says “I am not confused at the moment”. Even though it clearly says the person is not confused, it will be labeled as confused with keyword-based strategy! Keyword-based labeling does not consider the context of a sentence, and that is the biggest limitation of this approach.

Smarter Approach: Zero-Shot Topic Classification!

Source: Google Images

Zero-Shot Learning aims to associate a label to the data irrespective of the domain the model has been trained on! The idea is that the model takes a sentence and a hypothesis as input, then decides whether the hypothesis follows the sentence.

For example, consider the sentence “I love this food!”. The hypothesis would be: The sentence is positive, the sentence is negative or the sentence is neutral. The classifier model will then determine a relationship between the sentence and each hypothesis. The scores for each hypothesis are “softmaxed”[1] to find the most relevant class for the input sentence.

You can refer to the Paper [2], Model Hub of Hugging Face [3], and an amazing free demo [4] for more detailed information.

Zero-Shot Classification in Action!

The code for Zero-Shot Classification is very intuitive and easy to use (thanks to team Hugging Face). The following example uses the BART-Large model for classification. You can see the magic in just a few lines of code!

The following code labels the first five rows of data frame and prints the confidence score for input labels on the terminal.

Once you have a confidence score for your labels, you can apply a decision threshold (like 0.5) to assign the final label to each user's answer.

It takes ~2 hours to label 100,000K sentences on Nvidia GeForce GTX 1060 4GB GPU @ 33MHz.

Verifying Accuracy of Labelling: How do you know the generated labels are correct?

You could randomly take 100–150 samples of labeled data and see if the labels are assigned correctly. If you analyze lots of False Positives (Not Confused answers are assigned as Confused) in the dataset, you can increase the decision threshold from 0.5 to 0.75 (or tune it for your problem at hand)

How do you know the label distributions are correct?

We shared our findings and data with the stakeholders to verify if the label distribution makes sense (based on their past experience). With the right questions asked to the team, we can be confident that the model has approximately captured the correct “signal” for the labels. There will be some proportion of noisy labels in the labeled dataset but as long as they are less, it won’t affect the model much.

How do you evaluate the classification model?

Since the model has labels from the BART/Zero-Shot Classification in training and testing sets, the performance of the model would be very high! But, it is a good idea to build a Golden Test Set, which consists of a small set of human-labeled data.

Golden Test Set will have unseen data points which will be used to calculate model metrics like F1 Score, Confusion Matrix, etc. Let’s assume it takes ~30 seconds to label a sentence as Confused or Not-Confused, then you could easily label 100+ sentences per hour. If you have a smaller team to help you build the dataset, you could easily have 1000 data points in the Golden Test Set (but, be careful of human bias in labeling!)

Closing thoughts

The objective of this blog was to introduce the power of zero-shot classification to label the dataset smartly. This is a very powerful and effective technique in Natural Language Processing. We are working on this and several other cutting-edge AI techniques and delivering value to our customers.

We at AlgoAnalytics take pride in being the One Stop AI Shop ® We innovate and build scalable ML solutions. To check our amazing demos, please visit https://onestop.ai and for further information, please contact: info@algoanalytics.com

References

  1. Softmax Function: https://en.wikipedia.org/wiki/Softmax_function
  2. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach: https://arxiv.org/abs/1909.00161
  3. Hugging Face Model Library: https://huggingface.co/models?filter=zero-shot-classification
  4. Zero Shot Demo: https://huggingface.co/zero-shot/

--

--

Abhishek Pawar
AlgoAnalytics

Senior Data Scientist @ Precisely. 80+ 🔥 Bookings on Topmate