NLP — Text PreProcessing — Parts of Speech (Pos)Tags (Part 4)

Chandu Aki
The Deep Hub
Published in
7 min readFeb 17, 2024

Introduction to Linguistic Concepts in NLP:

In the vast realm of Natural Language Processing (NLP), understanding the structure and meaning of language is paramount. Before delving into the intricacies of Part of Speech (POS) tagging, let’s embark on a journey through some foundational linguistic concepts that lay the groundwork for comprehending the nuances of language.

Morphology:

  • Morphology explores the internal structure of words, focusing on how they are formed and modified through affixes, roots, and stems.
  • Example: In the word “unhappiness,” “un-” is a prefix indicating negation, and “-ness” is a suffix denoting a state or quality.

Syntax:

  • Syntax deals with the arrangement and relationships of words to form grammatically correct sentences.
  • Example: In the sentence “The cat chased the mouse,” the syntax dictates the order in which the words are arranged to convey meaning.

Semantics:

  • Semantics explores the meaning of words and how they combine to form meaningful sentences.
  • Example: The word “run” has a different meaning when used in the context of “She likes to run a marathon” versus “She will run an errand.”

Now, let’s venture into the fascinating world of Part of Speech (POS) tagging:

Part of Speech (POS) Tagging:

  • POS tagging is the process of assigning grammatical categories or “tags” to each word in a sentence based on its syntactic role.
  • The development of POS tagging can be traced back to linguistic theories and the need for machines to understand the grammatical structure of language.
  • POS tags are short codes representing different grammatical categories assigned to words.

Common tags include:

  • Noun (NN): Represents a person, place, thing, or idea.
  • Verb (VB): Denotes an action or state of being.
  • Adjective (JJ): Describes or modifies a noun.
  • Adverb (RB): Modifies a verb, adjective, or other adverbs.

Example 1 :

sentence = “The delicious meal”

output = [('The', 'DT'), ('delicious', 'JJ'), ('meal', 'NN')]

Understanding the abbreviations:

These 30 commonly used POS tags cover a wide range of grammatical categories, providing a foundation for understanding the roles of words in sentences during natural language processing.

Example 2:

sentence = “The curious cat explored the mysterious garden.”

output = [('The', 'DT'), ('curious', 'JJ'), ('cat', 'NN'), ('explored', 'VBD'), ('the', 'DT'), ('mysterious', 'JJ'), ('garden', 'NN'), ('.', '.')]

Do i need to learn the Parts of Speech in English to generate these tags ?

No.

How these Parts of Speech (Pos Tags ) are generated for the given sentence ?

we have pos_tag() library in nltk tool kit to achieve this

Explanation of pos_tag() Function:

In the realm of Natural Language Processing (NLP), the pos_tag() function is a powerful tool provided by the NLTK (Natural Language Toolkit) library. This function is specifically designed to perform Part of Speech (POS) tagging on a sequence of words, typically a sentence.

How pos_tag() Works:

  1. Tokenization: Before applying POS tagging, the input text is tokenized, which means breaking the text into individual words or tokens.
  2. POS Tagging: The pos_tag() function assigns a POS tag to each token in the input sequence, indicating the grammatical category or role of the word in the sentence.
  3. Output: The result is a list of tuples, where each tuple contains a word from the input text along with its corresponding POS tag.

Now, let’s see the pos_tag() function in action with Python code using NLTK:

import nltk
# Sample sentence
sentence = "The curious cat explored the mysterious garden."
# Tokenization
tokens = nltk.word_tokenize(sentence)
# POS Tagging
pos_tags = nltk.pos_tag(tokens)
# Display the result
print(pos_tags)

Output:

[('The', 'DT'), ('curious', 'JJ'), ('cat', 'NN'), ('explored', 'VBD'), ('the', 'DT'), ('mysterious', 'JJ'), ('garden', 'NN'), ('.', '.')]

Although it seems easy, Identifying the part of speech tags is much more complicated than simply mapping words to their part of speech tags.

Why Difficult ?

Words often have more than one POS tag. Let’s understand this by taking an easy example.

In the below sentences focus on the word “back” :

The relationship of “back” with adjacent and related words in a phrase, sentence, or paragraph is changing its POS tag.

It is quite possible for a single word to have a different part of speech tag in different sentences based on different contexts. That is why it is very difficult to have a generic mapping for POS tags.

Applications / Industry Use Cases:

  • Information Retrieval: POS tagging helps search engines understand the relationships between words, improving the accuracy of search results.
  • Grammar Checking: In language processing tools, POS tagging aids in identifying grammatical errors and suggesting corrections.
  • Machine Translation: Understanding the syntactic role of words enhances the accuracy of translating sentences between languages.
  • Syntactic Analysis: POS tagging is crucial for understanding the grammatical structure of a sentence, enabling syntactic analysis. It helps identify the subject, verb, object, and other syntactic elements.
  • Semantic Analysis:POS tags contribute to understanding the meaning of words in context. For example, distinguishing between a noun and a verb can significantly impact the interpretation of a sentence.
  • Named Entity Recognition (NER): POS tags play a role in named entity recognition by providing information about the grammatical category of words. For example, recognizing that “New York” is a proper noun.

Workflow of POS Tagging in NLP

The following are the processes in a typical natural language processing (NLP) example of part-of-speech (POS) tagging:

  • Tokenization: Divide the input text into discrete tokens, which are usually units of words or subwords. The first stage in NLP tasks is tokenization.
  • Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the relevant language model. These models offer a foundation for comprehending a language’s grammatical structure since they have been trained on a vast amount of linguistic data.
  • Text Processing: If required, preprocess the text to handle special characters, convert it to lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
  • Linguistic Analysis: To determine the text’s grammatical structure, use linguistic analysis. This entails understanding each word’s purpose inside the sentence, including whether it is an adjective, verb, noun, or other.
  • Part-of-Speech Tagging: To determine the text’s grammatical structure, use linguistic analysis. This entails understanding each word’s purpose inside the sentence, including whether it is an adjective, verb, noun, or other.
  • Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the source text. Determine and correct any possible problems or mistagging.

Types of POS tagging

Each type of POS tagging method has its own set of characteristics and is suited to different scenarios in natural language processing.

Advantages of POS Tagging:

  • Syntactic Analysis: POS tagging facilitates syntactic analysis by providing information about the grammatical roles of words in sentences.
  • Improved Information Retrieval: Enhances the accuracy of information retrieval systems by considering the syntactic structure of queries and documents.
  • Grammar Checking: Aids in grammar checking tools by identifying and correcting grammatical errors in text.
  • Machine Translation:Improves the accuracy of machine translation systems by considering the syntactic roles of words in source and target languages.
  • Text Summarization: Supports text summarization algorithms by identifying and extracting important information based on grammatical roles.
  • Named Entity Recognition (NER):POS tagging is a crucial step in NER, helping identify and classify named entities such as names, locations, and organizations.
  • Part of NLP Pipelines: Integrates seamlessly into broader NLP pipelines, contributing to tasks such as sentiment analysis, information extraction, and more.

Disadvantages of POS Tagging:

  • Ambiguity: Words often have multiple meanings, leading to ambiguity in POS tagging. A word may serve as different parts of speech in different contexts.
  • Out-of-Vocabulary Words: Struggles with words not present in the training data, leading to challenges when dealing with newly coined terms or evolving language.
  • Domain Specificity: Performance may vary across different domains, and POS taggers trained on general corpora may not perform optimally for specific industries or topics.
  • Languages with Limited Resources: For languages with limited linguistic resources or non-standard grammar, building accurate POS taggers can be challenging.
  • Complexity in Morphologically Rich Languages: Morphologically rich languages with complex word forms can pose challenges for POS tagging due to variations in inflections and word forms.
  • Training Data Dependency: Performance heavily depends on the quality and representativeness of the training data. Inadequate or biased training data may lead to inaccurate tagging.
  • Parsing Errors: Errors in POS tagging can propagate to downstream parsing tasks, impacting the overall accuracy of NLP applications.

--

--

Chandu Aki
The Deep Hub

Aspiring Data Scientist|Dynamic Data Analyst | Sales Analytics Expert | AI & ML , NLP , Generative AI Enthusiast