Lemmatization and Stemming in NLP

Moaaz
4 min readAug 27, 2022

--

In this tutorial, I will go through some simple text preprocessing techniques and explain lemmatization and stemming in Natural Language Processing

But first, let’s get some vocabulary straight:

Corpus — is essentially a paragraph of sentences. We refer to it as a corpus in NLP.

Document — is simply a single sentence.

Vocabulary — is a dictionary of all the unique words in our corpus.

IMPORTING LIBRARIES

With that out of the way, let’s import all libraries:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

Let’s first load the data and do all of the text preprocessing. We will be using a paragraph from Wikipedia and storing it in a variable corpus, here’s the link:

TEXT PREPROCESSING

First of all, we see that there are lots of punctuation and brackets in the corpus. We will remove them with the help of re which is a library for dealing with regular expressions.

import re
cleaned_corpus = re.sub(‘[^a-zA-Z0–9]’, ‘ ‘, corpus)
#The above line uses the sub method to just remove everything except #for letters and numbers in the corpus.

Let’s print the new corpus:

Cleaned corpus, all braces and punctuations are removed.

Now, we will lowercase the paragraph and then use the word_tokenize method of nltk to convert the corpus into individual words. We will also remove any stopwords. Stopwords are those words that don’t really contribute to our use case like “is”, “am”, “was” etc.

tokenized = word_tokenize(cleaned_corpus)words = [i for i in tokenized if i not in stopwords.words(‘english’)    and i.isalpha()]words

The first line converts the corpus into words. The list comprehension loops through every word and adds it to words if it is not a stopword and it is an alphabet. The stopwords.words method contains a dictionary of common stopwords in the English language like “is”, “am” etc.

Output, a large list of words

LEMMATIZATION AND STEMMING

The next step is lemmatization and stemming. Lemmatization and stemming are techniques that reduce a word to its base word. The difference between lemmatization is that it will convert a word to its base form only if it exists. For instance, the words “going”, “goes”, and “gone” all can be reduced to a single word “go”.

On the other hand, stemming reduces a word to its base form regardless of whether it exists or not. For instance, the words “history”, and “histories” will be reduced to “histori”, which is not a word.

Therefore, lemmatization is a better approach, although stemming is also used in many use cases. Moreover, lemmatization is slow as compared to stemming.

EXAMPLE OF STEMMING

from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()
words=[‘history’, ‘histories’]
reduced = [stemmer.stem(i) for i in words]
reduced

We import PorterStemmer from nltk, instantiate an object of it, and use its method stem on our list of words. You can see in the output below that the words are reduced to histori

Example of stemming

EXAMPLE OF LEMMATIZATION

An example of lemmatization is shown below:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = [‘history’, ‘histories’]
reduced = [lemmatizer.lemmatize(i) for i in words]
reduced

We import WordNetLemmatizer from nltk, instantiate an object of it, and use its method stem on our list of words. You can see in the output below that the words are reduced to history.

An example of lemmatization

NOW LET’S GET BACK TO THE SUBJECT!

Now, let’s apply lemmatization onto our cleaned_corpus :

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(i) for i in words]
Applying lemmatization on our cleaned corpus

You can see that some words like “summarized” were not lemmatized. That is because the base word for summarized does not exist in the dictionary of WordNetLemmatizer. I tried it individually and got the same output:

See, it is not reduced to base form!

Well, that’s about it for this tutorial. Hope you found it useful. If you did, give me a clap and follow me on Medium, for which I’ll be grateful.

--

--

Moaaz

A software engineering undergraduate who is highly enthusiastic about Data Science, Machine Learning, and Web Development.