Text-PreProcessing — what is Stemming in NLP? And Type Of Stemming Techniques

TejasH MistrY
3 min readApr 5, 2024

--

Dive into the world of Natural Language Processing (NLP) and explore the concept of stemming, an essential technique in text pre-processing. Learn about different stemming techniques and how they help simplify text analysis tasks.

Text-PreProcessing — Stemming in NLP

What is stemming in NLP?

In natural language processing (NLP), stemming is a technique used to reduce words to their root form, known as the stem. The goal of stemming is to normalize words by stripping off prefixes and suffixes, so that different variations of the same word are treated as the same word.

For example, consider the words “running”, “runs”, and “runner”. After stemming, all these words would be reduced to the same stem “run”. This allows NLP algorithms to treat these variations of the word “run” as equivalent, which can simplify text analysis tasks such as search, classification, and sentiment analysis.

stemming helps in reducing the complexity of words by converting them to their basic form, so that variations of the same word are recognized as identical. It’s like trimming off unnecessary parts of words to focus on their essential meaning, making it easier for computers to understand and process natural language text.

Types of Stemmer in NLTK

  1. Porter Stemmer
  2. Snowball Stemmer
  3. RegexpStemmer

1. Porter Stemmer

The Porter Stemmer, developed by Martin Porter, is one of the most widely used stemming algorithms.

It applies a series of heuristic rules to remove common prefixes and suffixes from words, aiming to reduce them to their base or root form.

The Porter Stemmer is known for its simplicity and speed, making it suitable for various text-processing tasks.

To use the Porter Stemmer in NLTK, you first need to import the PorterStemmer class from the nltk.stem module.

from nltk.stem import PorterStemmer

words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

stemming = PorterStemmer()

for word in words:
print(word+" ----> "+stemming.stem(word))
Output:

eating ----> eat
eats ----> eat
eaten ----> eaten
writing ----> write
writes ----> write
programming ----> program
programs ----> program
history ----> histori
running ----> run
cats ----> cat
jumped ----> jump
faster ----> faster
quickly ----> quickli

2. Snowball Stemmer

The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter Stemmer. It supports stemming for multiple languages and provides better performance and accuracy compared to the Porter Stemmer.

To use the Snowball Stemmer in NLTK, you need to import the SnowballStemmer class from the nltk.stem module.

from nltk.stem import SnowballStemmer

stammer = SnowballStemmer("english")

words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

for word in words:
print(word+" ---> "+stammer.stem(word))
eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
programs ---> program
history ---> histori
running ---> run
cats ---> cat
jumped ---> jump
faster ---> faster
quickly ---> quick

Difference Between Porter Stemmer and Snowball Stemmer

3. RegexpStemmer

RegexpStemmer class is a stemming algorithm that allows you to define custom stemming rules using regular expressions.

Regular expressions are like patterns we can use to find and change parts of words.

For example, if we want to change words ending in “ing” to just their root form, like changing “running” to “run,” we can create a rule for that.

Define stemming rules using regular expressions:

pattern = r”(ing$|s$|ed$|er$|est$|ly$)”

For example:

  • If you give it “running,” and you’ve made a rule to remove “ing,” it will change it to “run.”
  • If you give it “cats,” and you’ve made a rule to remove “s,” it will change it to “cat.”

To use the RegexpStemmer in NLTK, you need to define a regular expression pattern that specifies the affixes or parts of words you want to remove during stemming.

from nltk.stem import RegexpStemmer

words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]

# Define stemming rules using regular expressions
pattern = r"(ing$|s$|ed$|er$|est$|ly$)"

regexp_stemmer = RegexpStemmer(pattern)

for word in words:
print(word+" ---> "+regexp_stemmer.stem(word))
Outputs:

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> writ
writes ---> write
programming ---> programm
programs ---> program
history ---> history
running ---> runn
cats ---> cat
jumped ---> jump
faster ---> fast
quickly ---> quick

--

--

TejasH MistrY

Machine learning enthusiast breaking down complex Ml/AI concepts and exploring their real-world impact.