Stemming vs Lemmatization

The ultimate resource for learning the fundamentals of stemming and lemmatization.

4 min readMay 24, 2022

I’m sure you’re familiar with Natural Language Processing (NLP), but if you’re not, you may read my article “Understanding Natural Language Processing (NLP)”. I’ll be here to assist you in comprehending the use case.

NLP is commonly used to examine text data, and a machine learning model will assist you in analyzing text by transforming it into vectors (numerical representation). But first, we must transform text data into a format that the model we will develop can understand. Our model will be unable to grasp the text unless it is converted to vectors. As a result, it is critical to convert text into vectors.

To do so, we must first reduce the text to its simplest form, which requires the usage of techniques like stemming and lemmatization. These basic words will then be transformed into vectors. Search engines and chatbots utilize stemming and lemmatization to determine the meaning of a word. Converting text into vectors is thus the very first step in beginning NLP.

In short, both stemming and lemmatizing try to reduce words to their simplest form.

Python vs R

The Ultimate Guide to know the basic difference between Python and R

medium.com

What is Stemming?

Stemming is a process for eliminating affixes from words in order to retrieve their base form. It’s the same as pruning a tree’s branches down to the trunk. The stem of the terms eating, eats, and eaten, for example, is eat.

Search engines index words using stemming. As a result, rather than saving all versions of a word, a search engine can simply save the stems. Stemming minimizes the size of the index while increasing retrieval accuracy.

To further grasp this, consider the following example.

To begin, we must import the natural language toolkit (nltk).

import nltk

Import the PorterStemmer class to implement the Porter Stemmer algorithm.

from nltk.stem import PorterStemmer

Then, as shown below, create an instance of the Porter Stemmer class.

ps = PorterStemmer()

Enter the word/words you want to stem now.

words = ["active", "actives", "activate", "activated", "activating"]
  
for w in words:
    print(w, " : ", ps.stem(w))--------------------------------------------------------------------
# OUTPUT
--------------------------------------------------------------------active  :  activ
actives  :  activ
activate  :  activ
activated  :  activ
activating  :  activ

Words like “active”, “activate”, “activated”, and “activating” are all changed to their core word, “activ”. The word “activ” doesn’t make any sense, right? The stemming algorithm operates by removing the suffix from the word. In a larger sense, it removes either the beginning or the end of a word.

This is the process of stemming. Let’s jump into Lemmatization.

What is Lemmatization?

Lemmatization is similar to stemming, however stemming does not always provide a meaningful representation, but lemmatization will assist you in obtaining a meaningful representation that is easily understood.

To further grasp this, consider the following example.

To begin, we must import the natural language toolkit (nltk).

import nltk

Import the WordNetLemmatizer class to implement the lemmatizer algorithm

from nltk.stem import WordNetLemmatizer

Create an instance of the lemmatizer class

lemmatizer = WordNetLemmatizer()

Enter the word/words you want to lemmatize now.

print("active", " : ", lemmatizer.lemmatize("active"))
print("actives", " : ", lemmatizer.lemmatize("actives"))
print("activate", " : ", lemmatizer.lemmatize("activate"))
print("activated", " : ", lemmatizer.lemmatize("activated"))
print("activating", " : ", lemmatizer.lemmatize("activating"))--------------------------------------------------------------------
# OUTPUT
--------------------------------------------------------------------active  :  active
actives  :  active
activate  :  activate
activated  :  activated
activating  :  activating

This will make more sense than stemming. Lemmatizer reduces ambiguity in writing. In essence, it will return all words with the same meaning but various representations to their underlying form. It decreases the word density in the provided text and aids in the preparation of correct features for machine learning. The cleaner the data, the smarter and more accurate your machine learning model will be. The NLTK Lemmatizer will also save memory and computational costs. But, wait, why are we doing this?

Why is stemming and lemmatization required?

When evaluating texts or phrases, you must identify the fundamental word (stem word). This will give you a sense of how they feel. When performing text analysis and attempting to discover restaurant reviews (good or bad), or determining if an email is a spam or not.

Assume you’re working on a project that requires you to recognize movie feelings. You have the reviews and can choose whether they are favorable or unfavorable.

To do the same, you must first identify the root word; this word will then assist you in analyzing the sentiment. It is not feasible to accomplish it adequately without stemming or lemmatization.

Why is Lemmatization preferable to Stemming?

Lemmatization, on the other hand, is a more powerful procedure that takes into account the morphological examination of the words. It returns the basic form of all its inflectional forms, the lemma. To develop dictionaries and find the correct form of a word, much linguistic expertise is necessary. Stemming is a broad process, but lemmatization is an intelligent operation that looks for the correct form in the dictionary. As a result, lemmatization aids in the formation of superior machine learning features.

Conclusion

When you look at stemming for “activate” and “activated”, the result is the same “activ”, however, the NLTK lemmatizer produces a separate lemma for both tokens “activate” for “activate” and “activated” for “activated”. So, if we need to create a feature set to train a computer, lemmatization would be ideal.

Analyzing IBM Employee Attrition

Identifying the factors which influence the attrition of employees

medium.com

Thank you for reading! I would appreciate it if you follow me or share this article with someone and keep an eye out for more fascinating stories. Best wishes.

Your support would be awesome❤️

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Stemming vs Lemmatization

The ultimate resource for learning the fundamentals of stemming and lemmatization.

Python vs R

The Ultimate Guide to know the basic difference between Python and R

What is Stemming?

What is Lemmatization?

Why is stemming and lemmatization required?

Why is Lemmatization preferable to Stemming?

Conclusion

Analyzing IBM Employee Attrition

Identifying the factors which influence the attrition of employees

Your support would be awesome❤️

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Dhruval Patel