How and Why to Implement Stemming and Lemmatization from NLTK

In this article, we try to solve one of NLP’s problems by implementing Stemming and Lemmatization

Manmohan Singh
Apr 20 · 5 min read
Image for post
Image for post
Source: pixxabay.com

The English language has more than a million words in its vocabulary. Around 170k are in current use. These words grouped to form a sentence by following grammatical rules. Due to logical reasons, sentences use a different form of words derived from one another, such as plays, played, and playing.

While working in Natural Language Processing (NLP) models and problems, these words not help much. The main focus of NLP problems is to achieve the result from fewer words. Solving this problem saves a lot of processing time and disk space.

In this article, we try to solve this NLP problem by implementing Stemming and Lemmatization. Both methods convert derived words to their base words.

However, these two methods use different algorithms and are not the same; this article we go over these differences and Natural Language ToolKit (NLTK) implementation.

Stemming

Stemming achieves the root word by cutting the last alphabet letters of a word. These root words are also known as stems. But stem not always become a root word. And the sentence becomes meaningless. Stemming also reduces the accuracy of a model.

There are different types of stemming algorithms. We use only Porter’s algorithm and the Snowball algorithm in this article. These algorithms are most effective than others.

NLTK implementation of Porter Stemmer.

import nltkporter_stemmer = nltk.PorterStemmer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [porter_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)
Image for post
Image for post

This method converted the words ‘ready’ and ‘this’ to ‘readi’ and ‘thi’ and make the sentence meaningless. Also, after the conversion of the word ‘his’ to ‘hi,’ the meaning of the sentence changes. I do not recommend this method to build any critical project. Use this method for study purposes only.

Snowball Stemmer is an improved version of the Porter stemmer. This method is highly precise over large data-sets.

NLTK implementation of SnowBall Stemmer.

import nltksnowball_stemmer = nltk.SnowballStemmer(‘english’)text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [snowball_stemmer.stem(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Stemmed text : {‘ ‘.join(text_without_stopword)}”)
Image for post
Image for post

Word his not converted to hi by this method. Letters are properly chopped off from words cutting, claims, and rights. We can say that there is an improvement. But the conversion of words ‘once’ and ‘monastry’ to ‘onc’ and ‘monastri’ makes the sentence meaningless.

Lemmatization

Lemmatization method has analyzed the structure of words, the relationship between words and parts of words to accurately identify the root word. Part of speech tagger and vocabulary words helps to return the dictionary form of a word. But this requires a lot of processing time and disk space. The accuracy of the NLP model is comparatively high in this method. The root word is known as a lemma.

NLTK implementation of Lemmatization.

from nltk.stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()text = f” He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
text_without_stopword = [lemmatizer.lemmatize(word) for word in text.split()]print(f”Original text: {text} \n”)
print(f”Lemmetazied text : {‘ ‘.join(text_without_stopword)}”)
Image for post
Image for post

The lemmatization method converts the words ‘claims’ and ‘rights’ to ‘claim’ and ‘right.’ Other words are un-affected. The meaning of sentences is intact.

Code to distinguish between Lemmatization and Stemming

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
ps = nltk.PorterStemmer()
stemmer = nltk.SnowballStemmer(‘english’)
text = f”He determined to drop his litigation with the monastery, and relinquish”\
f” his claims to the wood cutting and fishery rights at once. “\
f”He was more ready to do this.”
porter_stem_text = [ps.stem(word) for word in text.split()]
snowball_stem_text = [stemmer.stem(word) for word in text.split()]
lemmatize_stem_text = [lemmatizer.lemmatize(word) for word in text.split()]
print(f”Original text: {text} \n”)
print(f”Porter Stemmed text : {‘ ‘.join(porter_stem_text)}\n”)
print(f”Snoball Stemmed text :{‘ ‘.join(snowball_stem_text)}\n”)
print(f”Lemmatize text : {‘ ‘.join(lemmatize_stem_text)}\n”)
Image for post
Image for post

Porter and Snoball stemming methods convert some words to non-dictionary words. Such conversion of words restricts the use of porter and snowball stemming methods to search engines, n-gram context, and text classification problems.

Lemmatization can be used in paragraph/document summarization, word/sentence prediction, sentiment analysis, and others.

Conclusion

The selection of Stemming or Lemmatization is solely dependent upon project requirements. Lemmatization is mandatory for critical projects and projects where sentence structure matter like language applications. Stemming or Lemmatization do affect precision and recall. Stemming reduces precision performance, and increases recall performance.

Hopefully, this article helps you with NLP models and problems.

Other Articles by Author

  1. First step in EDA : Descriptive Statistic Analysis
  2. Automate Sentiment Analysis Process for Reddit Post: TextBlob and VADER
  3. Discover the Sentiment of Reddit Subgroup using RoBERTa Model

Towards AI

The Best of Tech, Science, and Engineering.

Sign up for Towards AI Newsletter

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe with us to receive our newsletter right on your inbox. For sponsorship opportunities, please email us at pub@towardsai.net Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Manmohan Singh

Written by

Know more about me on Linkedin: https://www.linkedin.com/in/manmohan-singh-9570758a/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Manmohan Singh

Written by

Know more about me on Linkedin: https://www.linkedin.com/in/manmohan-singh-9570758a/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store