Lemmatization [NLP, Python]

Yash Jain
4 min readFeb 22, 2022

--

Lemmatization is the process of replacing a word with its root or head word called lemma. Aim is to reduce inflectional forms to a common base form.

A lemmatizer uses a knowledge base of word synonyms and word endings to ensure that only words that mean similar things are consolidated into a single token.

If you have access to information about connections between the meanings of various words, you might be able to associate several words together even if their spelling is quite different. This more extensive normalization down to the semantic root of a word as said its lemma.

For example “am,” “are,” “is”, “was”, “were”, would all be treated the same as ‘be’ in an NLP pipeline with lemmatization, even though they have different meanings.

Some points to remember:
- It is slower than stemming
- Accuracy is more than stemming
- Dictionary based approach
- It is preferred to retain meaning in sentence
- Depends heavily on POS tag for finding root word or lemma

Applications
- Making more generalized document matrix instead of sparse one
- Widely used in web search results
- Information retrieval

Let’s try hands-on

Let’s get some info about WordNet before applying wordnet lemmatizer →
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

NLTK wordnet lemmatizer

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
for item in ['am' ,'are' ,'is','was','were']:
print(lemmatizer.lemmatize(item,pos='v'),end='\t')

Output:

be	be	be	be	be

Notice we have mentioned pos=’v’ this means we have gave lemmatizer pos tag manually as verb(‘v’), otherwise it would have taken ’n’ → noun as default pos tag, which in-turn gives us wrong result am are is wa were. So we have to manually give pos tag here.

Let’s automate this process and detect pos tag with nltk.pos_tag and give it to wordnetlemmatizer.

from nltk.corpus import wordnet
import nltk
lemmatizer = WordNetLemmatizer()
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].lower()
tag_dict = {"a": wordnet.ADJ,
"n": wordnet.NOUN,
"v": wordnet.VERB,
"r": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)for items in ['best', 'well','better']+['be', 'was','were', 'is', 'am']:
print(f"{items:<6}--> {lemmatizer.lemmatize(items, pos=get_wordnet_pos(items))}", end='\n')

Output:

best  --> best
well --> well
better--> well
be --> be
was --> be
were --> be
is --> be
am --> be

Notice here ‘best’ remained same as it is but will be converted to ‘well’ in spacy lemmatizer

spaCy lemmatizer

import spacy
nlp = spacy.load('en_core_web_lg')
for token in nlp('best well better be was were is am'):
print(f"{token.text:<6}--> {token.lemma_}", end='\n')

Output:

best  --> well
well --> well
better--> well
be --> be
was --> be
were --> be
is --> be
am --> be

If compared to nltk wordnetlemmatizer you can notice ‘best’ has been converted to ‘well’ too. So seems spacy is working better.

Let’s see some pos tag and verb form and understand lemmatization better.

import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('best well better be was were is am')
print(f"{'Text':{8}} | {'Lemma':{6}} | {'POS':{6}} | {'POS explained':{14}} | {'TAG':{4}} | {'Tag explained'}\n",'-'*65)
for token in doc:
print(f'{token.text:{8}} | {token.lemma_:{6}} | {token.pos_:{6}} | {spacy.explain(token.pos_):{14}} | {token.tag_:{4}} | {spacy.explain(token.tag_)}')

Output:

You can also checkout spacy glossary for tag explanations here

Stanza lemmatizer

Lemmatization in stanza is performed by the LemmaProcessor and can be invoked with the name lemma and this lemma processor requires TokenizeProcessor, MWTProcessor and POSProcessor.

import stanzanlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')doc = nlp('best well better be was were is am')print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

Output:

word: best 	lemma: well
word: well lemma: well
word: better lemma: well
word: be lemma: be
word: was lemma: be
word: were lemma: be
word: is lemma: be
word: am lemma: am

Well you can notice ‘am’ was not converted to ‘be’ which is VBP form(verb, non-3rd person singular present).

LemmInflect

I found lemminflect tested on Automatically Generated Inflection Database as baseline, and Although it takes more time but accuracy seems to be fairly.

Installation (pip install lemminflect)

| Package          | Verb  |  Noun | ADJ/ADV | Overall |  Speed  |
|----------------------------------------------------------------|
| LemmInflect | 96.1% | 95.4% | 93.9% | 95.6% | 42.0 uS |
| CLiPS/pattern.en | 93.6% | 91.1% | 0.0% | n/a | 3.0 uS |
| Stanford CoreNLP | 87.6% | 93.1% | 0.0% | n/a | n/a |
| spaCy | 79.4% | 88.9% | 60.5% | 84.7% | 5.0 uS |
| NLTK | 53.3% | 52.2% | 53.3% | 52.6% | 13.0 uS |
|----------------------------------------------------------------|

So let’s try this one

import spacy
import lemminflect
nlp = spacy.load('en_core_web_sm')
doc = nlp('best well better be was were is am')
# for item in doc:
# print(item, item._.lemma())
print(f"{'Text':{8}} | {'Lemma':{6}}\n")
for token in doc:
print(f'{token.text:{8}} | {token._.lemma():{6}}')

Output:

Text     | Lemma 

best | well
well | well
better | good
be | be
was | be
were | be
is | be
am | am

Well in this case better converted to good we can say this maybe correct as tag on tokens depends on whole sentence, and it might have considered it as comparative form of adjective instead of adverb

To inflect words, use the method getInflection. This takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with that tag.

from lemminflect import getLemma
getLemma('watches', upos='VERB')

Output:
('watch',)

Well in these four packages You can experiment with few more corpus/text and test whether spacy, LemmInflect or any other libraries suits your need.

--

--

Yash Jain

Data Scientist/ Data Engineer at IBM | Alumnus of @niituniversity | Natural Language Processing | Pronouns: He, Him, His