Text-PreProcessing — what is Stemming in NLP? And Type Of Stemming Techniques
Dive into the world of Natural Language Processing (NLP) and explore the concept of stemming, an essential technique in text pre-processing. Learn about different stemming techniques and how they help simplify text analysis tasks.
What is stemming in NLP?
In natural language processing (NLP), stemming is a technique used to reduce words to their root form, known as the stem. The goal of stemming is to normalize words by stripping off prefixes and suffixes, so that different variations of the same word are treated as the same word.
For example, consider the words “running”, “runs”, and “runner”. After stemming, all these words would be reduced to the same stem “run”. This allows NLP algorithms to treat these variations of the word “run” as equivalent, which can simplify text analysis tasks such as search, classification, and sentiment analysis.
stemming helps in reducing the complexity of words by converting them to their basic form, so that variations of the same word are recognized as identical. It’s like trimming off unnecessary parts of words to focus on their essential meaning, making it easier for computers to understand and process natural language text.
Types of Stemmer in NLTK
- Porter Stemmer
- Snowball Stemmer
- RegexpStemmer
1. Porter Stemmer
The Porter Stemmer, developed by Martin Porter, is one of the most widely used stemming algorithms.
It applies a series of heuristic rules to remove common prefixes and suffixes from words, aiming to reduce them to their base or root form.
The Porter Stemmer is known for its simplicity and speed, making it suitable for various text-processing tasks.
To use the Porter Stemmer in NLTK, you first need to import the PorterStemmer
class from the nltk.stem
module.
from nltk.stem import PorterStemmer
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]
stemming = PorterStemmer()
for word in words:
print(word+" ----> "+stemming.stem(word))
Output:
eating ----> eat
eats ----> eat
eaten ----> eaten
writing ----> write
writes ----> write
programming ----> program
programs ----> program
history ----> histori
running ----> run
cats ----> cat
jumped ----> jump
faster ----> faster
quickly ----> quickli
2. Snowball Stemmer
The Snowball Stemmer, also known as the Porter2 Stemmer, is an improved version of the original Porter Stemmer. It supports stemming for multiple languages and provides better performance and accuracy compared to the Porter Stemmer.
To use the Snowball Stemmer in NLTK, you need to import the SnowballStemmer
class from the nltk.stem
module.
from nltk.stem import SnowballStemmer
stammer = SnowballStemmer("english")
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]
for word in words:
print(word+" ---> "+stammer.stem(word))
eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
programs ---> program
history ---> histori
running ---> run
cats ---> cat
jumped ---> jump
faster ---> faster
quickly ---> quick
Difference Between Porter Stemmer and Snowball Stemmer
3. RegexpStemmer
RegexpStemmer class is a stemming algorithm that allows you to define custom stemming rules using regular expressions.
Regular expressions are like patterns we can use to find and change parts of words.
For example, if we want to change words ending in “ing” to just their root form, like changing “running” to “run,” we can create a rule for that.
Define stemming rules using regular expressions:
pattern = r”(ing$|s$|ed$|er$|est$|ly$)”
For example:
- If you give it “running,” and you’ve made a rule to remove “ing,” it will change it to “run.”
- If you give it “cats,” and you’ve made a rule to remove “s,” it will change it to “cat.”
To use the RegexpStemmer
in NLTK, you need to define a regular expression pattern that specifies the affixes or parts of words you want to remove during stemming.
from nltk.stem import RegexpStemmer
words = ["eating","eats","eaten","writing","writes","programming","programs","history","running", "cats", "jumped", "faster", "quickly"]
# Define stemming rules using regular expressions
pattern = r"(ing$|s$|ed$|er$|est$|ly$)"
regexp_stemmer = RegexpStemmer(pattern)
for word in words:
print(word+" ---> "+regexp_stemmer.stem(word))
Outputs:
eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> writ
writes ---> write
programming ---> programm
programs ---> program
history ---> history
running ---> runn
cats ---> cat
jumped ---> jump
faster ---> fast
quickly ---> quick