In English words (Other language as well), same word may have different form such as “affected”, “affects” and “affect”.
To have a smaller size vocabulary and better representation on NLP problem, we want to have a single word to represent “”, “” in some scenarios. In this article, we will go through some libraries to work on lemmatization.
Before lemmatization, article have to been tokenizated. If you do not familiar with word tokenization. You may have a look on this article.
Step 1: Environment Setup
Install spaCy (2.0.11)
pip install spacy==2.0.11
Step 2: Import library
import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')
Step 4: Normalize
doc = spacy_nlp(article)
tokens = [token.text for token in doc]print('Original Article: %s' % (article))
print()for token in doc:
if token.text != token.lemma_:
print('Original : %s, New: %s' % (token.text, token.lemma_))
Result:
Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
Original : Lemmatisation, New: lemmatisation
Original : linguistics, New: linguistic
Original : is, New: be
Original : grouping, New: group
Original : inflected, New: inflect
Original : forms, New: form
Original : they, New: -PRON-
Original : analysed, New: analyse
Original : identified, New: identify
spaCy will convert word to lower case and changing past tense, gerund form (other tenses as well) to present tense. Also, “they” normalize to “-PRON-” which is pronoun.
Step 1: Environment Setup
pip install nltk==3.2.5
Step 2: Import library
import nltk
print('NLTK Version: %s' % (nltk.__version__))nltk.download('wordnet')wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()
Step 3: Check
tokens = nltk.word_tokenize(article)print('Original Article: %s' % (article))
print()for token in tokens:
lemmatized_token = wordnet_lemmatizer.lemmatize(token)
if token != lemmatized_token:
print('Original : %s, New: %s' % (token, lemmatized_token))
Result:
Original Article: Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
Original : forms, New: form
Original : as, New: a
The result is totally difference from spaCy. Only two words are lemmaizated and one of them “as” is strange. It seems that “s” will removed if it is the last character. Therefore, “as” is converted to “a”
Conclusion
The demonstration can be found in the Jupyter Notebook.
The result of spaCy is better and expected. Taking “as” an example, it seems that spaCy” has a kind of “intelligent” that it will convert “as” as “a”. Therefore, I further studying on source code, it seems like there are well defined word and rule to support lemmatization.
# Copy from spacy/lang/en/lemmatizer/_lemma_rules.pyADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]# Copy from spacy/lang/en/lemmatizer/_nouns_irreg.pyNOUNS_IRREG = {
"aardwolves": ("aardwolf",),
"abaci": ("abacus",),
"aboideaux": ("aboideau",),
"aboiteaux": ("aboiteau",),
"abscissae": ("abscissa",),
"acanthi": ("acanthus",),
"acari": ("acarus",),
...
}
For France, spaCy did the similar thing.
# Copy from spacy/lang/fr/lemmatizer.pyLOOKUP = {
"Ap.": "après",
"Apr.": "après",
"Auxerroises": "Auxerrois",
"Av.": "avenue",
"Ave.": "avenue",
"Avr.": "avril",
"Bd.": "boulevard",
"Boliviennes": "Bolivien",
"Canadiennes": "Canadien",
"Cannoises": "Cannois",
...
}
TL;DR
How does spaCy work on lemmatizion in Enlgish. From source code, it will go through POS (Part of Speech) first. Lemmatization will be performed if the word is noun, verb, adjective or adverb. Later on, it will check whether existing in irregular list. Lemmatized word will be returned if existing in irregular list. Otherwise, it will go the pre-defined suffix rule.