Text-PreProcessing — What is Lemmatization in NLP?

TejasH MistrY
3 min readApr 5, 2024

--

Delve into the concept of lemmatization in Natural Language Processing (NLP) and its significance in simplifying text analysis. Learn how lemmatization reduces words to their base forms and aids in various NLP tasks.

Text-PreProcessing — What is Lemmatization in NLP?

What is Lemmatization in NLP?

The Lemmatization is a tool provided by NLTK for reducing words to their base or canonical form, known as the lemma.

Unlike stemming, which simply chops off affixes from words, lemmatization takes into account the morphological analysis of the words and returns a valid lemma.

NLTk provides a WordNetLemmatizer class which is a thin wrapper around the Wordnet corpus. This class uses morphy() function to the wordnet corpusReader class to find a lemma.

we can use Wordnet Lemmatizer for Q&A, chatbots, text summarization

Morphological analysis is the process of breaking down words into their constituent parts, known as morphemes, and analyzing their structure, form, and meaning within a language. Morphemes are the smallest units of meaning in a language, and they can be divided into two main types: roots and affixes.

Here’s a simplified explanation of morphological analysis:

  1. Roots: Roots are the core or base components of words that carry the main meaning. For example, in the word “playful,” the root is “play,” which conveys the idea of engaging in an activity for enjoyment.
  2. Affixes: Affixes are added to roots to modify their meaning or grammatical function. There are two types of affixes: prefixes (added at the beginning of a word) and suffixes (added at the end of a word). For example, in the word “unhappy,” the prefix “un-” negates the meaning of the root “happy,” while in the word “running,” the suffix “-ing” indicates ongoing action.

During morphological analysis, linguists or language processing algorithms examine words to identify their roots, prefixes, suffixes, and any other morphological components. This process helps in understanding how words are formed, their grammatical functions, and their relationships with other words in a language.

In natural language processing (NLP), morphological analysis is an essential task for tasks like tokenization, stemming, and lemmatization. It helps in preprocessing text data by reducing words to their basic forms, which aids in tasks like text normalization, information retrieval, and machine translation.

Before we move to the code part let's first understand what is verbs, adjectives, and adverbs are.

  1. Verbs: Verbs are action words that describe what someone or something is doing. They represent actions, events, or states of being. Verbs answer the questions “What is happening?” or “What did someone do?”

For example:

  • Action verbs: run, jump, eat, sleep
  • State-of-being verbs: is, am, are, was, were

2. Adjectives: Adjectives are words that describe or modify nouns (people, places, things, or ideas) by providing additional information about their qualities or characteristics. Adjectives answer the question “What kind?” or “Which one?”

For example:

  • Descriptive adjectives: happy, tall, blue, delicious
  • Demonstrative adjectives: this, that, these, those

3. Adverbs: Adverbs are words that modify verbs, adjectives, or other adverbs by providing information about how, when, where, or to what extent an action or quality occurs. Adverbs answer questions like “How?” “When?” “Where?” or “To what extent?”

For example:

  • Adverbs of manner: quickly, slowly, carefully, loudly
  • Adverbs of time: now, later, yesterday, soon
  • Adverbs of place: here, there, nearby, outside

Code of lemmatization for words categorized as verbs, adjectives, and adverbs.

from nltk.stem import WordNetLemmatizer

# Create an instance of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Examples of words categorized as verbs
verbs = ["running", "went", "eating"]

# Examples of words categorized as adjectives
adjectives = ["better", "worst", "faster"]

# Examples of words categorized as adverbs
adverbs = ["quickly", "slowly", "hardly"]

# Lemmatize verbs
print("Lemmatized Verbs:")
for verb in verbs:
lemma = lemmatizer.lemmatize(verb, pos='v') # 'v' indicates verb
print(f"Original: {verb}\t Lemma: {lemma}")

# Lemmatize adjectives
print("\nLemmatized Adjectives:")
for adjective in adjectives:
lemma = lemmatizer.lemmatize(adjective, pos='a') # 'a' indicates adjective
print(f"Original: {adjective}\t Lemma: {lemma}")

# Lemmatize adverbs
print("\nLemmatized Adverbs:")
for adverb in adverbs:
lemma = lemmatizer.lemmatize(adverb, pos='r') # 'r' indicates adverb
print(f"Original: {adverb}\t Lemma: {lemma}")
Output:

Lemmatized Verbs:
Original: running Lemma: run
Original: went Lemma: go
Original: eating Lemma: eat

Lemmatized Adjectives:
Original: better Lemma: good
Original: worst Lemma: bad
Original: faster Lemma: fast

Lemmatized Adverbs:
Original: quickly Lemma: quickly
Original: slowly Lemma: slowly
Original: hardly Lemma: hardly

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex Ml/AI concepts and exploring their real-world impact.