Stemming and Lemmatization— NLP

3 min readSep 19, 2023

Stemming

Stemming is the process of reducing the words till the stem/base word is reached. It chops off the letters from the end. While searching for a specific keyword it returns certain variations of the word from the document.

For Ex: Searching for the word ‘boat’, returns boating, boater, boating, etc

Here the stem word is boat and suffixes are removed until the stem word is reached.

There are 2 types of stemmer:

Porter Stemmer
Snowball Stemmer

Porter Stemmer:

One of the most common and effective stemming tools. It undergoes five phases of word reduction.

from nltk.stem.porter import PorterStemmer

p_stemmer = PorterStemmer()
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
for word in words:
    print(word + '---->' + p_stemmer.stem(word))
print("\n")

OUTPUT:
run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair

Snowball Stemmer:

It is an improvement of Porter's stemmer in terms of speed and accuracy.

from nltk.stem.snowball import SnowballStemmer

p_stemmer = PorterStemmer()
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
for word in words:
    print(word + '---->' + p_stemmer.stem(word))
print("\n")

OUTPUT: 
run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair

Lemmatization

Lemmatization analysis structure of the words. It is more useful and informative than stemming. It analyzes surrounding words for determining the pos (part of the speech) of the word.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'I am a runner running in a race because i love to run since i ran today')
for token in doc:
    print(f'{token.text:{12}} {token.pos:{6}} {token.lemma:<{22}} {token.lemma_}')
show_lemmas(doc)


OUTPUT:
I                95 4690420944186131903    I
am               87 10382539506755952630   be
a                90 11901859001352538922   a
runner           92 12640964157389618806   runner
running         100 12767647472892411841   run
in               85 3002984154512732771    in
a                90 11901859001352538922   a
race             92 8048469955494714898    race
because          98 16950148841647037698   because
i                95 4690420944186131903    I
love            100 3702023516439754181    love
to               94 3791531372978436496    to
run             100 12767647472892411841   run
since            98 10066841407251338481   since
i                95 4690420944186131903    I
ran             100 12767647472892411841   run
today            92 11042482332948150395   today

Stop Words

Stop words are frequently occurring words that don’t require tagging. So these are filtered from the text to be processed. These are frequently used nouns, and verbs like is, the, a, etc.

Default stop words:
We can get the length of the default stop words set as well using the len() function.

import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words:
print(nlp.Defaults.stop_words)

print(f"\n Length of default stop words is: {len(nlp.Defaults.stop_words)}")

OUTPUT:
{"'d", "'ll", "'m", "'re", "'s", "'ve", 'a',.. '’ll', '’m', '’re', '’s', '’ve'}

Length of default stop words is: 326

See if a word is a stop word or not:
To check if a particular word is a stop word or not, we can check that using is_stop.

#Checks in entire set
nlp.vocab['is'].is_stop  
nlp.vocab['mystery'].is_stop

OUTPUT:
True
False

Add a stop word:
We can add a new stop word in the default set of stop words.

nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop

OUTPUT:
True

Delete a stop word:
We can delete a word from the defaults and check to confirm if it has been removed.

nlp.Defaults.stop_words.remove('beyond')
nlp.vocab['beyond'].is_stop

OUTPUT:
False

Find my source code for this article

Stemming

NLP-NaturalLanguageProcessing/Stemming.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

github.com

Lemmatization

NLP-NaturalLanguageProcessing/Lemmatization.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

github.com

Stop Words

NLP-NaturalLanguageProcessing/stop_words.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

github.comx

About Me

I’m Shreya Khandelwal, a Data Scientist. Feel free to connect with me on LinkedIn!
Follow me on Medium for regular updates on similar topics.

Shreya khandelwal - Medium

Read writing from Shreya khandelwal on Medium. Data Scientist at IBM | Multi-Cloud Certified | AI & Analytics |…

medium.com

Stemming and Lemmatization— NLP

Stemming

Porter Stemmer:

Snowball Stemmer:

Lemmatization

Stop Words

Find my source code for this article

NLP-NaturalLanguageProcessing/Stemming.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

NLP-NaturalLanguageProcessing/Lemmatization.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

NLP-NaturalLanguageProcessing/stop_words.ipynb at main ·…

Contribute to Shreya-khandelwal/NLP-NaturalLanguageProcessing development by creating an account on GitHub.

About Me

Shreya khandelwal - Medium

Read writing from Shreya khandelwal on Medium. Data Scientist at IBM | Multi-Cloud Certified | AI & Analytics |…

Written by Shreya khandelwal