NLP — Text PreProcessing — Part 3 (Stemming & Lemmatization)
In the previous article (NLP — Text PreProcessing — Part 2), we delved into the world of tokenization. Now, in this sequel, our quest continues as we unravel more enchanting techniques of text preprocessing:
- Stemming
- Lemmatizer
Stemming
Stemming is a technique used in Natural Language Processing (NLP) that involves reducing words to their base or root form, called stems. It plays a vital role in various NLP tasks, such as text analysis, information retrieval, and sentiment analysis.
Why do we use? It helps in simplifying words to their common base, aiding in text analysis and information retrieval.
How does it work? Using a stemming algorithm, words like “running” and “runner” are reduced to the common base “run.”
Applications & Use Cases:
- Information retrieval
- Search engines
- Spam filtering
Types of Stemming:
Porter Stemmer:
- One of the oldest and widely used stemming algorithms.
- Tends to be more aggressive, resulting in potentially less accurate stems.
Python Code and Output:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
Output: ['run', 'runner', 'ran']
Lancaster Stemmer:
- More aggressive than the Porter Stemmer, often resulting in shorter stems.
- Can be useful for certain tasks but may sacrifice accuracy.
Python Code and Output:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [ls.stem(word) for word in words]
print(stemmed_words)
Output: ['run', 'run', 'ran']
Regex Stemmer:
- Allows for customized rules using regular expressions.
- Useful when the standard stemmers might not cover specific cases.
Python Code and Output:
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$')
words = ["running", "runner", "ran"]
stemmed_words = [rs.stem(word) for word in words]
print(stemmed_words)
Output: ['runn', 'runner', 'ran']
Applications & Use Cases — Choosing the Right Stemmer:
Choose the stemming algorithm based on the specific needs of your NLP task, balancing accuracy and computational efficiency.
Absolutely! While stemming is a powerful technique for reducing words to their base or root form, it does come with some limitations. Let’s explore the problems in stemming using examples:
1. Overstemming:
- Issue: Overstemming occurs when the stemming algorithm is too aggressive and removes too many characters, leading to the loss of meaning.
- Example: The word “changing” might be stemmed to “chang,” losing the meaning of the base word “change.”
Python Code and Output:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word = "changing"
stemmed_word = ps.stem(word)
print(stemmed_word)
Output: 'chang'
2. Understemming:
- Issue: Understemming occurs when the stemming algorithm is too conservative and fails to remove enough characters, leading to retained suffixes.
- Example:The word “studying” might be stemmed to “studi,” missing the base word “study.”
Python Code and Output:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word = "studying"
stemmed_word = ps.stem(word)
print(stemmed_word)
Output: 'studi'
3. Loss of Meaning:
- Issue: Stemming can result in the loss of semantic meaning, especially when the base form of a word is ambiguous.
- Example:The word “better” might be stemmed to “better,” but without context, it might be unclear whether it means “improve” or “superior.”
Python Code and Output:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word = "better"
stemmed_word = ps.stem(word)
print(stemmed_word)
Output: 'better'
4. Ambiguity:
- Issue: Stemming can lead to ambiguous results when a single stem corresponds to multiple base words.
- Example: The word “flies” might be stemmed to “fli,” which could refer to both the verb “fly” and the noun “fli.”
Python Code and Output:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word = "flies"
stemmed_word = ps.stem(word)
print(stemmed_word)
Output: 'fli'
While stemming is a valuable preprocessing step, it’s crucial to be aware of its limitations, especially in cases where the loss of meaning can impact the desired analysis or understanding of the text. In such situations, more advanced techniques like lemmatization might be considered for preserving semantic meaning.
Is there any solutions for this problem ?
Yes, there are alternative solutions to address the limitations of stemming and the issues related to overstemming, understemming, loss of meaning, and ambiguity. One such alternative is lemmatization.
Lemmatization:
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma.
How it works? Unlike stemming, lemmatization considers the context of the word and aims to return a meaningful base form.
Applications & Use Cases:
- Suitable for tasks where preserving the semantic meaning of words is crucial.
- Beneficial in applications like question-answering systems, language translation, and sentiment analysis.
Python Code and Output:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["changing", "studying", "better", "flies"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
Output: ['changing', 'studying', 'better', 'fly']
Advantages of Lemmatization:
- Preservation of Meaning:Lemmatization considers the context and preserves the semantic meaning of words.
- Reduced Ambiguity:By providing more accurate base forms, lemmatization reduces ambiguity compared to stemming.
Considerations:
- While lemmatization is a powerful technique, it might be computationally more intensive than stemming. The choice between stemming and lemmatization depends on the specific requirements of the NLP task.
Conclusion:
- In situations where maintaining the exact meaning of words is critical, lemmatization serves as an effective alternative to address the limitations associated with stemming. However, it’s essential to consider the trade-offs and choose the appropriate technique based on the nature of the text data and the objectives of the analysis.