Stemming and Lemmatization— NLP

Shreya khandelwal
3 min readSep 19, 2023
Credits: Respective Owner

Stemming

Stemming is the process of reducing the words till the stem/base word is reached. It chops off the letters from the end. While searching for a specific keyword it returns certain variations of the word from the document.

For Ex: Searching for the word ‘boat’, returns boating, boater, boating, etc

Here the stem word is boat and suffixes are removed until the stem word is reached.

There are 2 types of stemmer:

  • Porter Stemmer
  • Snowball Stemmer

Porter Stemmer:

One of the most common and effective stemming tools. It undergoes five phases of word reduction.

from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
for word in words:
print(word + '---->' + p_stemmer.stem(word))
print("\n")

OUTPUT:
run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair

Snowball Stemmer:

It is an improvement of Porter's stemmer in terms of speed and accuracy.

from nltk.stem.snowball import SnowballStemmer
p_stemmer = PorterStemmer()
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
for word in words:
print(word + '---->' + p_stemmer.stem(word))
print("\n")

OUTPUT:
run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair

Lemmatization

Lemmatization analysis structure of the words. It is more useful and informative than stemming. It analyzes surrounding words for determining the pos (part of the speech) of the word.

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am a runner running in a race because i love to run since i ran today')
for token in doc:
print(f'{token.text:{12}} {token.pos:{6}} {token.lemma:<{22}} {token.lemma_}')
show_lemmas(doc)


OUTPUT:
I 95 4690420944186131903 I
am 87 10382539506755952630 be
a 90 11901859001352538922 a
runner 92 12640964157389618806 runner
running 100 12767647472892411841 run
in 85 3002984154512732771 in
a 90 11901859001352538922 a
race 92 8048469955494714898 race
because 98 16950148841647037698 because
i 95 4690420944186131903 I
love 100 3702023516439754181 love
to 94 3791531372978436496 to
run 100 12767647472892411841 run
since 98 10066841407251338481 since
i 95 4690420944186131903 I
ran 100 12767647472892411841 run
today 92 11042482332948150395 today

Stop Words

Stop words are frequently occurring words that don’t require tagging. So these are filtered from the text to be processed. These are frequently used nouns, and verbs like is, the, a, etc.

Default stop words:
We can get the length of the default stop words set as well using the len() function.

import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words:
print(nlp.Defaults.stop_words)

print(f"\n Length of default stop words is: {len(nlp.Defaults.stop_words)}")

OUTPUT:
{"'d", "'ll", "'m", "'re", "'s", "'ve", 'a',.. '’ll', '’m', '’re', '’s', '’ve'}

Length of default stop words is: 326

See if a word is a stop word or not:
To check if a particular word is a stop word or not, we can check that using is_stop.

#Checks in entire set
nlp.vocab['is'].is_stop
nlp.vocab['mystery'].is_stop

OUTPUT:
True
False

Add a stop word:
We can add a new stop word in the default set of stop words.

nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop

OUTPUT:
True

Delete a stop word:
We can delete a word from the defaults and check to confirm if it has been removed.

nlp.Defaults.stop_words.remove('beyond')
nlp.vocab['beyond'].is_stop

OUTPUT:
False

Find my source code for this article

Stemming

Lemmatization

Stop Words

About Me

I’m Shreya Khandelwal, a Data Scientist. Feel free to connect with me on LinkedIn!

Follow me on Medium for regular updates on similar topics.

--

--