Stemming vs Lemmatization

Yigit Can Turk
Analytics Vidhya
Published in
4 min readNov 30, 2021

During any NLP (Natural Language Process) work, you need to follow some common steps that do not differ so much from task to task. One of these steps is having a representation of the text which is retrieved from blog posts, web pages, scientific papers, news etc. But the representation of the text is not something one and only. Each piece of text has multiple layers of the representation which has own analysis for each representation. These are lexical analysis, syntatic analysis, semantic analysis and pragmatic analysis. For each analysis of the text there is a common step which is tokenizing.

This illustration is retrieved from Coursera lecture Text Mining and Analytics of ChengXiang Zhai

What is tokenizing?

You can say that tokenizing is the process of splitting any text word-by-word simply. But of course it is not just the splitting of the text but it requires other processes in order to have an analysis on the text.

Tokenizing is not enough to use the tokens for analyzing directly. The tokens need another preprocesses also. If you consider there are some methods that use some metrics like frequency and these tokens may mislead the method to unsuccessful results. For example, in such a method the word ‘follow’ and another word ‘following’ are totally different and they do not have a relation. But as human being we can see there is a relation between these two words. Because the word ‘following’ is the continuous form of the word ‘follow’. On the other hand, it may be used in a sentence ‘I am now following you on Instagram’ or ‘Can you please read following sentences loudly ?’. In these two sentences the meaning of the word ‘follow’ are different. This is another challenge of any NLP related study. But I ignore this challenge for this article.

Let’s back to previous example of following and follow. As I mentioned above, seeing the relation between these two words is not hard for me as a human, but it is for the computer. So people created different approaches for this problem to make possible of distinguish suffix and the root word. I mean removing the ‘-ing’ part from the root word ‘follow’ is possible thanks to these methods.

Stemming & Lemmatization

The approaches stemming and lemmatization are very similar actually. Both focusses to extract the root word from a text token by removing the additional parts of this token. Most of the time using one of these methods is enough. But of course there are some different points of them. Let’s see some token example first.

Assume that we have a text and we parsed this text into its tokens. Now we have a list of tokens belong to input text.

Let’s apply another step ‘removing stop words’ before using stemming or lemmatization. These stop words are some words do not have so much effect on the text. Stop words are a, the, but, and, that etc..

Now imagine we applied stemming on these token list. (Before this step removing stop words applied.)

As you can see, there are some inappropriate output words like chang, guilti, polit and outcri. These words do not have meaning but after stemming these are the outputs. Now let’s see the result of lemmatization.

The result of lemmatization has almost totally fine except the word ‘guilty’. Because we expect having the root word ‘guilt’ from this word. But the result is still much better than stemming.

What is the difference?

The main difference of these two methods is stemming can return an outcome which has no meaning but the lemmatization does not. Because the lemmatization is scanning the related words on WordNet corpus but stemming sometimes does not create actual words. The scanning of a word on the corpus makes lemmatization is slower than stemming but more accurate on the results. So while the stemming offers the speed, lemmatization offers the accuracy.

What is WordNet corpus?

WordNet corpus is a lexical database for the English language. It is created by Princeton. It is used for finding the meanings of words, synonyms, antonyms, and more. For more information, use the following link ;)

--

--