NLP Pipeline: Lemmatization (Part 3)

Edward Ma
3 min readMay 27, 2018

--

Source: https://www.tell-a-tale.com/unbox-idea-social-open-mic-tell-a-story-to-change-world/

In English words (Other language as well), same word may have different form such as “affected”, “affects” and “affect”.

To have a smaller size vocabulary and better representation on NLP problem, we want to have a single word to represent “”, “” in some scenarios. In this article, we will go through some libraries to work on lemmatization.

Before lemmatization, article have to been tokenizated. If you do not familiar with word tokenization. You may have a look on this article.

Source: https://spacy.io/

Step 1: Environment Setup

Install spaCy (2.0.11)

pip install spacy==2.0.11

Step 2: Import library

import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')

Step 4: Normalize

doc = spacy_nlp(article)
tokens = [token.text for token in doc]
print('Original Article: %s' % (article))
print()
for token in doc:
if token.text != token.lemma_:
print('Original : %s, New: %s' % (token.text, token.lemma_))

Result:

Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Original : Lemmatisation, New: lemmatisation
Original : linguistics, New: linguistic
Original : is, New: be
Original : grouping, New: group
Original : inflected, New: inflect
Original : forms, New: form
Original : they, New: -PRON-
Original : analysed, New: analyse
Original : identified, New: identify

spaCy will convert word to lower case and changing past tense, gerund form (other tenses as well) to present tense. Also, “they” normalize to “-PRON-” which is pronoun.

Source: https://geonaut.co.uk/projects/programming/

Step 1: Environment Setup

pip install nltk==3.2.5

Step 2: Import library

import nltk 
print('NLTK Version: %s' % (nltk.__version__))
nltk.download('wordnet')wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

Step 3: Check

tokens = nltk.word_tokenize(article)print('Original Article: %s' % (article))
print()
for token in tokens:
lemmatized_token = wordnet_lemmatizer.lemmatize(token)

if token != lemmatized_token:
print('Original : %s, New: %s' % (token, lemmatized_token))

Result:

Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Original : forms, New: form
Original : as, New: a

The result is totally difference from spaCy. Only two words are lemmaizated and one of them “as” is strange. It seems that “s” will removed if it is the last character. Therefore, “as” is converted to “a”

Conclusion

The demonstration can be found in the Jupyter Notebook.

The result of spaCy is better and expected. Taking “as” an example, it seems that spaCy” has a kind of “intelligent” that it will convert “as” as “a”. Therefore, I further studying on source code, it seems like there are well defined word and rule to support lemmatization.

# Copy from spacy/lang/en/lemmatizer/_lemma_rules.pyADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
# Copy from spacy/lang/en/lemmatizer/_nouns_irreg.pyNOUNS_IRREG = {
"aardwolves": ("aardwolf",),
"abaci": ("abacus",),
"aboideaux": ("aboideau",),
"aboiteaux": ("aboiteau",),
"abscissae": ("abscissa",),
"acanthi": ("acanthus",),
"acari": ("acarus",),
...
}

For France, spaCy did the similar thing.

# Copy from spacy/lang/fr/lemmatizer.pyLOOKUP = {
"Ap.": "après",
"Apr.": "après",
"Auxerroises": "Auxerrois",
"Av.": "avenue",
"Ave.": "avenue",
"Avr.": "avril",
"Bd.": "boulevard",
"Boliviennes": "Bolivien",
"Canadiennes": "Canadien",
"Cannoises": "Cannois",
...
}

TL;DR

How does spaCy work on lemmatizion in Enlgish. From source code, it will go through POS (Part of Speech) first. Lemmatization will be performed if the word is noun, verb, adjective or adverb. Later on, it will check whether existing in irregular list. Lemmatized word will be returned if existing in irregular list. Otherwise, it will go the pre-defined suffix rule.

--

--

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/