NLP: A quick guide to Stemming

Tushar Srivastava
5 min readAug 6, 2019

Stemming is basically removing the suffix from a word and reduce it to its root word.

For example: “Flying” is a word and its suffix is “ing”, if we remove “ing” from “Flying” then we will get base word or root word which is “Fly”.

We uses these suffix to create a new word from original stem word.

Here is the link to official docs of NLTK on Stemming

The stem of the verb wait is wait: it is the part that is common to all its inflected variants.

  1. wait (infinitive)
  2. wait (imperative)
  3. waits (present, 3rd person, singular)
  4. wait (present, other persons and/or plural)
  5. waited (simple past)
  6. waited (past participle)
  7. waiting (progressive)

Sometime spelling may also change in order to make a new word.

  1. beauty, duty + -ful → beautiful, dutiful (-y changes to i)
  2. heavy, ready + -ness → heaviness, readiness (-y changes to i)
  3. able, possible + -ity → ability, possibility (-le changes to il)
  4. permit, omit + -ion → permission, omission (-t changes to ss)

Now Next question arise is why do we need it in Natural Language Processing or Natural Language Understanding.

The main aim is to reduce the inflectional forms of each word into a common base word or root word or stem word.

Inflection is a process of word formation, in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness

There are majorly 2 errors in Stemming Algorithms which are as follows.

OverStemming

Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive.

  • universal
  • university
  • universe

All the above 3 words are stemmed to univers which is wrong behavior.

Though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in NLP/NLU will likely reduce the relevance of the search results

UnderStemming

Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative. Below is the example for the same.

  • alumnus
  • alumni
  • alumnae

As of now there are lot of ways by which we can stem a word and in this article we will be focusing on 3 stemming techniques which are part of truncating Stemming Algorithm, We wont be discussing about statistical or mixed Stemming Algorithm here.

Porter Stemmer

  • This is one of the most common and gentle stemmer, Its fast but not very precise.

Below is the implementation. You can use Jupyter Notebook to run the below code.

If you are facing any issue regarding NLTK then you can refer to my previous article on Stop Words where I have taken care of all the above installation.

In output we can see how the suffixes were removed.

Look at the input and you can see we are passing “was” and getting “wa” as output. This is something which should be considered under less precise algorithm. To increase the precision another algorithm came which was SnowBall Stemmer.

Snowball Stemmer

  • The actual name of this stemmer is English Stemmer or Porter2 Stemmer
  • There were some improvements done on Porter Stemmer which made it more precise over large data-sets

Below is the implementation. You can use Jupyter Notebook to run the below code.

As this was an improvement over Porter Stemmer hence we can see in the results how gracefully it handled “was” input. There was lots of improvement done in this algorithm. Hence currently it is one of my favorite algorithm to work with.

There is one more impotant feature added to this algorithm which was excluding Stop Word Stemming.

Due to this feature we observed difference in “was” input.

Below is the implementation for the same.

You can also check the languages supported in SnowBall Stemmer.

Lancaster Stemmmer

  • It is very aggressive algorithm
  • It will hugely trim down your working set, this statement itself has pros and cons, sometime you many want this in your datasets but maximum time you will be avoiding it.

Aggression can be observed by “Caring” input, It was converted to “car” which is altogether a different word in English dictionary.

Conclusion

Snowball Stemmer cames out to be one of the best suited algorithm for my needs, but this totally depends on use case and dataset.

One more important thought, We should Do Stemming first or remove Stop Words first?????

So as per my thinking this depends,

  • If your dataset is having stopwords with stems then you should go for stemming first and then stopword removal.
  • If your dataset is not having stopword without stems then you should go for Stopwords removal first and then Stemming second.

To know more about stopwords you can click here

I hope now Stemming is not a jargon to you.

In case you want to checkout the Jupyter Notebook here is the link to my github repo.

--

--

Tushar Srivastava

¿ʞuıɥʇ ǝuıɥɔɐɯ uɐƆ 3+ years of experience in applied Machine Learning, Deep Learning, Public Speaking, Programming Outreach.