Top 3 Stemming Techniques & its Performance Metrics in Python

Ajay Khanna
6 min readFeb 9, 2022

Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.

@https://devopedia.org/

Why Stemming?

The English language has several variants of a single term. The presence of these variances in a text corpus results in data redundancy when developing NLP or machine learning models. Such models may be ineffective.

To build a robust model, it is essential to normalize text by removing repetition and transforming words to their base form through stemming.

Application of Stemming

  1. Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in domain analysis.For instance, a Google search for prediction and predicted returns comparable results
  2. In information retrieval, text mining SEOs, indexing, tagging systems, and word analysis, stemming is employed.

Types of Stemmer in NLTK

As we can see, from the above figure, there’s a subdivision of the major algorithms into three distinct classes, namely: Truncating, Statistical and Mixed. Let’s dive into each one of them and look at the well-known algorithms.

  1. Truncating:

Simplest of all, truncating algorithms aim at just slicing the words and converting them to a fixed length words.

Lovin’s Stemmer: The Lovins stemmer follows a rule-based affix elimination approach. Fairly, a simple algorithm which performs a finite number of rules to obtain stems. Basically, removes the longest affixes and retains the valid word from the stemmed word. It’s extremely fast but has a high chance of over-stemming error. In some cases, it has failed miserably but on the whole, it is seen to work well with irregular plurals.

Find the documentation at https://pypi.org/project/stemming/1.0/

from stemming.lovins import stem as  lovins_stemmerlovins_stemmer("factionally")Output - fact

Porter’s Stemmer: This is a 60 rule algorithm. Modifications were done on the basic stemmer and seen to be widely used stemmer. Series of steps are followed to get stem words. Each step is about length of the suffixes and how it will be replaced to smaller suffix. This algorithm is seen as a slower process compared to the above even though Lovin’s lookup table is extensive.

Find the documentation at https://pypi.org/project/stemming/1.0/

from stemming.porter2 import stemstem("factionally")Output - faction

Dawson’s Stemmer: Extension to Lovin’s Stemmer except the arrangement of lookup table is reversed order with length as its index. Also, a single pass stemmer and proven to be faster than Lovin’s. Covers more suffixes than Lovins and is fast in execution.

Paice/Husk Stemmer: It is an iterative algorithm with one table containing about 120 rules indexed by the last letter of a suffix.The advantage is its simple form and every iteration taking care of both deletion and replacement as per the rule applied. The disadvantage is it is a very heavy algorithm and over stemming may occur.It allows custom stemming rule sets, so the paicehusk module also includes a PaiceHuskStemmer class you can instantiate with custom rules.

Find full details at https://pypi.org/project/stemming/1.0/

2. Statistical

Statistical algorithms are based on analytical approach. They apply mathematical approximation to get the stem words.

N-gram: As the name suggests, this algorithm calculates n-grams(n=2,3 usually), which is essentially breakage of the words into n length words and then applying any statistical analysis on top of them in order to identify them. Only disadvantage being, it requires high memory power to run and store the data.

3. Mixed

Krovetz Stemmer: This stemming algorithm converts plural to singular form and converts past to present tense. This is often used with another algorithm. But can’t single handedly manage for large documents. Seen to not produce consistent results in terms of accuracy.

Xerox Stemmer: Equiped to handle large data and is seen to produce valid words. Chances of over stemming is high, its dependence on lexicon makes it language dependent. Hence its limitation is its language specific.

YASS Stemmer: A corpus based method which needs computational resource to work efficiently. But the advantage of this algorithm is its language independent. Still is seen to be as a comlpex method involving statistical analysis and hierarchical clustering approach.

Other Techniques

Lancaster Stemmer — LancasterStemmer()

Lancaster Stemmer is straightforward, although it often produces results with excessive stemming. Over-stemming renders stems non-linguistic or meaningless.

LancasterStemmer() is a module in NLTK that implements the Lancaster stemming technique. Allow me to illustrate this with an example.

Example of LancasterStemmer()

In the example below, we construct an instance of LancasterStemmer() and then use the Lancaster algorithm to stem the list of words.

from nltk.stem import LancasterStemmerlancaster = LancasterStemmer()words = ['connecting','connect','factionally','faction',"consult","consulation"]for word in words:print(word,"--->",lancaster.stem(word))Output - 
connecting ---> connect
connect ---> connect
factionally ---> fact
faction ---> fact
consult ---> consult
consulation ---> cons

Regexp Stemmer — RegexpStemmer()

Regex stemmer identifies morphological affixes using regular expressions. Substrings matching the regular expressions will be discarded.

RegexpStemmer() is a module in NLTK that implements the Regex stemming technique. Let us try to understand this with an example.

Example of RegexpStemmer()

In this example, we first construct an object of RegexpStemmer() and then use the Regex stemming method to stem the list of words.

from nltk.stem import RegexpStemmerregexp = RegexpStemmer('ing$|s$|e$|able$', min=4)words = ['connecting','connect','factionally','faction',"consult","consulation"]for word in words:print(word,"--->",regexp.stem(word))Output -
connecting ---> connect
connect ---> connect
factionally ---> factionally
faction ---> faction
consult ---> consult
consulation ---> consulation

Typical errors of stemming are the following:

  • Overstemming: Happens when too much is removed. For example, ‘wander’ becomes ‘wand’; ‘news’ becomes ‘new’; or ‘universal’, ‘universe’, ‘universities’, and ‘university’ are all reduced to ‘univers’. A better result would be ‘univers’ for the first two and ‘universi’ for the last two.
  • Understemming: Happens when words are from the same root but are not seen that way. For example, ‘knavish’ remains as ‘knavish’; ‘data’ and ‘datum’ become ‘dat’ and ‘datu’ although they’re from the same root.
  • Misstemming: Usually not a problem unless it leads for false conflations. For example, ‘relativity’ becomes ‘relative’.

Performance Metrics for Stemming

Although there are very few ways for Evaluating Stemmed words from a large corpus of data . Following can be helpful for evaluating stemming of words-

  1. Index Compression Factor (ICF):

The index compression factor represents a percent by which a collection of distinct words is reduced by stemming. The large number of words stemmed will give the greater strength of the stemmer. This is calculated as :

𝐼𝐶𝐹 = ( 𝑁 − 𝑆 /𝑆) ∗ 100

where
N- Number of distinct words before stemming S- Number of distinct stems after stemming.

2. Word Stemmed Factor (WSF):

It is the percentage of words that have been stemmed by the stemming process out of the total words in a sample. Improved the number of words stemmed, greater the strength of the stemmer. Minimum threshold for this factor should be 50%.

𝑊𝑆𝐹 = (𝑊𝑆/𝑇𝑊) ∗ 100

3. Correctly Stemmed Words Factor (CSWF):

It is the percentage of the number of words stemmed. Higher the percentage of this factor, higher will be the accuracy of the stemmer. Minimum threshold for his factors should be 50%.

𝐶𝑆𝑊𝐹 = 𝐶𝑆𝑊 ∗ 100

4. Average Words Conflation Factor (AWCF):

This indicates the average number if variant words of different conflation group that are stemmed correctly to the root words. To calculate AWCF, first to find out the number of unique words after conflation as:

𝑵𝑾𝑪 = 𝑺 − 𝑪𝑾

where,
CW- Number of correct words not stemmed. Thus Word conflation Factor is obtained as:

𝐴𝑊𝐶𝐹 = 𝐶𝑆𝑊 − 𝑁𝑊𝐶 ∗ 100 𝐶𝑆𝑊

Greater the percentage of AWCF, greater will be the accuracy of the stemmer.

Conclusion

  1. The correctly stemmed factor is the maximum for the proposed stemmer. It is clear that the higher the correctly stemmed word factor, lesser the stemming errors. Hence the proposed stemmer has less over stemming and under-stemming errors.
  2. The average conflation factor of the proposed stemmer is high. This higher value proves that the efficiency of the stemmer.
  3. The higher value of correctly stemmed factor and average conflation factor indicates that the less over and under stemming errors rates.

References :

  1. https://www.ijrte.org/wp-content/uploads/papers/v8i4/C6200098319.pdf
  2. https://www.researchgate.net/publication/284038938_A_Comparative_Study_of_Stemming_Algorithms/link/56ef8c3008aed17d09f87abd/download

--

--

Ajay Khanna

Data Scientist with expertise in Text Mining , ML, DL and NLP