In this tutorial, I will go through some simple text preprocessing techniques and explain lemmatization and stemming in Natural Language Processing
But first, let’s get some vocabulary straight:
Corpus — is essentially a paragraph of sentences. We refer to it as a corpus in NLP.
Document — is simply a single sentence.
Vocabulary — is a dictionary of all the unique words in our corpus.
IMPORTING LIBRARIES
With that out of the way, let’s import all libraries:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
Let’s first load the data and do all of the text preprocessing. We will be using a paragraph from Wikipedia and storing it in a variable corpus, here’s the link:
TEXT PREPROCESSING
First of all, we see that there are lots of punctuation and brackets in the corpus. We will remove them with the help of re
which is a library for dealing with regular expressions.
import re
cleaned_corpus = re.sub(‘[^a-zA-Z0–9]’, ‘ ‘, corpus)
#The above line uses the sub method to just remove everything except #for letters and numbers in the corpus.
Let’s print the new corpus:
Now, we will lowercase the paragraph and then use the word_tokenize
method of nltk
to convert the corpus into individual words. We will also remove any stopwords. Stopwords are those words that don’t really contribute to our use case like “is”, “am”, “was” etc.
tokenized = word_tokenize(cleaned_corpus)words = [i for i in tokenized if i not in stopwords.words(‘english’) and i.isalpha()]words
The first line converts the corpus into words. The list comprehension loops through every word and adds it to words
if it is not a stopword and it is an alphabet. The stopwords.words
method contains a dictionary of common stopwords in the English language like “is”, “am” etc.
LEMMATIZATION AND STEMMING
The next step is lemmatization and stemming. Lemmatization and stemming are techniques that reduce a word to its base word. The difference between lemmatization is that it will convert a word to its base form only if it exists. For instance, the words “going”, “goes”, and “gone” all can be reduced to a single word “go”.
On the other hand, stemming reduces a word to its base form regardless of whether it exists or not. For instance, the words “history”, and “histories” will be reduced to “histori”, which is not a word.
Therefore, lemmatization is a better approach, although stemming is also used in many use cases. Moreover, lemmatization is slow as compared to stemming.
EXAMPLE OF STEMMING
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words=[‘history’, ‘histories’]
reduced = [stemmer.stem(i) for i in words]
reduced
We import PorterStemmer
from nltk
, instantiate an object of it, and use its method stem on our list of words. You can see in the output below that the words are reduced to histori
EXAMPLE OF LEMMATIZATION
An example of lemmatization is shown below:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = [‘history’, ‘histories’]
reduced = [lemmatizer.lemmatize(i) for i in words]
reduced
We import WordNetLemmatizer
from nltk
, instantiate an object of it, and use its method stem on our list of words. You can see in the output below that the words are reduced to history.
NOW LET’S GET BACK TO THE SUBJECT!
Now, let’s apply lemmatization onto our cleaned_corpus
:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(i) for i in words]
You can see that some words like “summarized” were not lemmatized. That is because the base word for summarized does not exist in the dictionary of WordNetLemmatizer
. I tried it individually and got the same output:
Well, that’s about it for this tutorial. Hope you found it useful. If you did, give me a clap and follow me on Medium, for which I’ll be grateful.