How to build a Lemmatizer

And why

Tiago Duque
Analytics Vidhya
Published in
11 min readMar 2, 2020

--

If you’re into NLP, you probably stumbled over a dozen tools that have this neat feature named “lemmatization”. In this article, I’ll do my best to guide you into what is Lemmatization, why is it useful and how can we build a Lemmatizer!

If you’re coming from my previous article onto how to make a PoS Tagger, you’ve already grasped the important prerequisites to do Lemmatization. If not, I’ll gently present them through the length of this article, so let’s get started!

What is Lemmatization?

Lemmas are also called the “Dictionary Form” of a word.

Lemmatization is a Natural Language Processing technique that proposes to reduce a word to its Lemma, or Canonical Form. What is a Lemma? A hint — it is also called Dictionary Form (there are many names for a simple concept, don’t you think?).

So a lemma is the base form of a word — this means that any variation related to time or quantity is removed. For example, in nouns, plurals (girls, boys, corpora) get reduced to its singular form (girl, boy, corpus); and in verbs, time/participle variants (ate, brought, chatting) are back to present tense (to eat, to bring, to chat).

In some cases, lemmatization can include removing gender variation (doctress → doctor), although it is very uncommon in English, since the language moved towards gender neutrality — however, it can be done in some specific cases (such as focusing onto a specific gender. Ex.: ‘bull’ → ‘cow’, ‘rooster’ → ‘chicken’).

What is lemma, according to WordWeb free dictionary?

Even though lemmatization might not seem as useful at first, it is a powerful tool for text normalization, since it allows normalization to occur in a more syntactical manner (verbs continue being verbs, nouns continue being nouns and so on) than stemming, that we analyzed in another article.

Now, as you might have noticed, a word lemma is closely tied to its part of speech. Why is that? Because each word is “house” to one or more lexemes. A Lexeme, by its turn, is the “basic abstract unit of meaning”[1] in lexical terms. In other “words”, every word can have one or more meanings (oh my, too many ‘word’ words for a single phrase!). I’ll not dive any deeper into this morphological/lexical/semantic discussion, but keep in mind this last assumption.

So, a word tied to it’s role into the sentence (part of speech) can elucidate a certain lexeme, while in another role, it elucidates a second. That is the case for “living” — it can elucidate a lexeme related to “work” if it is a noun, to “alive” if it is an adjective or to “to live”, if it is a verb. Complicated? Try to go further on that, and you tell me what’s complicated! (Also, to the linguists out there, please, forgive me for oversimplifying the concept of lexeme).

Now, what is a Lexeme, mr. WordWeb?

With this in mind, I hope that you can start to grasp the utility and importance of Lemmatization. If not, let us make it more clear.

By the way, I think that it is needless to say that Lemmatization is a task specific to each language, right?

Why and When should I use Lemmatization?

Here’s a scenario: you’re building a Sentiment Analysis tool for a real state company based on twitter posts on “housing”. As a good Data Scientist, you managed to get a stream of incoming Tweets with that word. Now, you want to analyze whether these tweets are positive or negative.

Housing prices at Gotham are too expensive! #myhouseisatfire #batmangetthesecorrupts #holyexpensive

Since you don’t know what Lemmatization is good for, you start building your analyzer: you use a stemmer to reduce the dimensionality of your input, you classify some inputs and make a machine learning classifier with 90%+ Accuracy! Good Job mr. Data Scientist! However… one day, there’s fiery discussion about a certain nation that is “housing” terrorists in its Embassy. What do you expect is going to happen to your ‘carefully’ engineered machine learning model?

That’s a situation where Lemmatization is useful. For instance, if you lemmatized entries before processing, you’d be able do discard tweets with “to house”, and not “housing” as a word. So Lemmatization can help you reduce dimensionality as stemmer does — however, this technique is more precise at keeping sentence structure and the correct lexemes in place (even though the sentence might seem babblish after Lemmatization).

If you want to learn more about the difference between stemming and lemmatization, I recommend you go read this very well tailored theoretical article:

Another situation where Lemmatization is useful is in model-based approaches. Most NLP today is used together with Machine Learning to generate models and inferences automatically. However, studies demonstrate that, in some cases, a curated model is more effective, if not necessary, to accomplish a task.

Picture from my masters thesis — lemmatization can help into building Knowledge Graphs.

In my masters, I worked at hand-modeling a Knowledge Graph about dairy farming domain to use it in Question Answering. In that case, using a model-based approach was useful because I did not have a large corpus, nor a contextualized tools to work with Dairy farming domain words. One of the things that my work was dependent on was to have a Lemmatizer to help preprocessing each question word to match the gender/time/quantity-neutral word nodes (or lexeme representations) that composed the Knowledge Graph.

So lemmatized entries are a good choice for matching — it reduces the need for modeling every single variation of a word, helping you to focus less on the keys to access the model, and more on the model (which is very important if you’re on a language like Portuguese, where there are more than 6 verbal tenses and gender variation).

Now, if you’re a word2vec baby, you might not see the real use of this technique — just vectorize it, you might say. Just wait, I’m sure that on your career, you’ll find out situations where a vector is not the best solution.

How to build a Lemmatizer

Time to get to work! The hardest part in lemmatization is retrieving the Part of Speech for a word. Thankfully, we’ve already done it in the last article, and there are many tools that can do it for us.

Here are some quick examples:

Having each word PoS, we can discuss how we can do Lemmatization. There are two main methods:

  • Rule-based method: uses a bunch of rules that tell how a word should be modified to extract its lemma. Example: if the word is a verb, and it is terminated with -ing, do some substitutions… This method is very tricky and probably does not give the best of results (hard to generalize in English).
  • Corpus-based method: uses a tagged corpus (or an annotated dataset) to provide the lemma for each word. Basically it is a huge list of words and their related lemma for each PoS (or not, if you’re doing a dumb approach). This, of course, requires access to an annotated corpus, which might be tricky (or expensive) to get.

For this tutorial I’ll use a combination of both, having the Corpus-based method for general words and a rule-based method for plural noun normalization. So, let’s get started!

Building a Corpus-based Lemmatizer

Previously in this series we trained a PoS Tagger on the GUM Corpus, made available in the Universal Dependencies project. This corpus is presented in the ConLLU format, which is a specific format for linguistic annotated files. In the ConLLU files provided for the GUM Corpus, aside from providing the Part of Speech for the words in the annotated phrases, the corpus also provide the word lemma, which will allow us to create a dictionary to quickly retrieve them during the lemmatization process.

I’ve merged the whole GUM corpus in a single file, so we can enjoy lemmas from test, train and dev test (we don’t need to benchmark here). You can download the merged conllu file here.

Let’s build a dictionary out of the words annotated in the file. For that we use the same conllu module used for PoS Tagging:

However, there’s a problem. What does it happen when we verify the length of the resulting dictionary?

>>> len(word_lemma_dict.keys())
12516

Eeww! Although the GUM corpus is made of annotated texts from distinct contexts, the number of distinct words is way below the expected. Just that you know, the size range of the english vocabulary for an adult native speaker is between 20k-35k distinct words, so 12k (which includes Proper Nouns) is probably not enough. What can we do?

Well, one way is to use other corpora/word lists. There’s an interesting one provided at https://lexically.net/, which is a website made specifically with word lists in mind. The one we’re interested in is the “lemma list 10 with c5” (Download here: ‘BNClemma10_3_with_c5.txt’), which is actually a list of lemmas and variations extracted from the British National Corpus (BNC) and annotated using the c5 tagset (the tagset for the BNC). Here’s a short view of it:

Lemma list from lexically.net. They have been extracted from the British National Corpus (BNC) and then tagged.

To use it, we have to make another converter (for it to work with our current PoS tags):

Okay, this was an ugly one, but it gets the work done. Also, I had to do it on my own — no help from linguists or grammar specialists. Notice that I converted “<UNC>” to “NOUN” because in most cases the nouns present in this word list are tagged “<UNC>” instead of NN+something.

Now, let us make use of what we already have and add some more words to our dictionary:

Results?

>>> len(word_lemma_dict.keys())
30987

Way better! Of course, some niche words won’t be there. But let us consider it enough. Time to put that to an use. Let me show you how easy it is to lemmatize a word now (btw, I’m just ignoring if the word and its PoS is not in the dict, which might not be the best solution):

Test it:

>>> words = [('living', 'ADJ'), ('living', 'NOUN'),('living','VERB'), ('guns','NOUN')]
>>>for word_tuple in words:
... print(lemmatize(word_tuple[0],word_tuple[1]))
...
living
living
live
guns

Working, but not perfect. ‘Guns’ was not taken to the singular form. What can we do? Just add it to our dict:

>>>word_lemma_dict['guns']['NOUN']=['gun']

Sure, that’s not the best solution, but, hey, we have 30k words. If a couple fail, it won’t be that bad. We can also keep adding more corpora to our dict (if we find them). For example, we can work with the English Web Treebank, which has over 254k tagged sentences! Try it yourself. I’ve made it to 37k distinct words.

Let’s save this dictionary so we can use it in our main tool:

>>>import pickle
>>>pickle.dump(word_lemma_dict, open('word_lemma_dict.p','wb'))

A rule based Lemmatizer for plural Nouns

Now, that problem with ‘guns’, bugged me. I examined and found that none of our lists consider singular forms as the lemma for a plural noun. We could make an expert system to deal with this. Here’s what I’ve done using the rules in https://www.grammar.cl/Notes/Plural_Nouns.htm

First, I got a list of the most common irregular nouns from http://www.esldesk.com/vocabulary/irregular-nouns. I’ve treated it and made a csv that you can download here.

I then loaded it to a dict and pickled it the same way as in the main dictionary, using the plural as the key and the singular as value. Here’s the function to do noun inflection (it’ll be bundled as a utility to our tool):

Simple and straightforward, it works for the cases mentioned in the link. Also, our ‘guns’ are finally lemmatized to ‘gun’! Now that we have a solution for most cases, it is time to implement these in our tool suite.

Working on NLPTools:

Since the beginning of this series, we’ve been building together a NLP tool suite from scratch. This section aims to continue this process. The project so far is available here.

For the folder structure, we have this:

As a reminder: greens are new additions, yellows are modifications. First, we make a new folder scaffold and add our word lemma dictionary and our irregular noun dictionary (preloaded/dictionaries/lemmas/).

I also created a utils folder and added a word_utils.py file with the inflect_noun_singular function mentioned above.

Then, we can create our lemmatization.py. It is the simpler file so far, even though I did my best to test it and make it fail-proof.

The last modification is in __init__.py, where I added lemmatization to the pipeline (removed stemming by default) and have set the PoSTagger to default to UD tags:

Checking if it works:

>>> doc = NLPTools.process("There are many dogs living in my living place")
>>>for sent in doc.sentences:
... print([(word.get(),word.PoS) for word in sent.tokens])
[('<SOS>', None), ('there', 'PRON'), ('are', 'VERB'), ('many', 'ADJ'), ('dog', 'NOUN'), ('live', 'VERB'), ('in', 'ADP'), ('my', 'DET'), ('living', 'NOUN'), ('place', 'NOUN'), ('<EOS>', None)]

Wonderful!

If you remind the stemmer tutorial, we tested the stemmer by checking the efficiency in the Universal Declaration of Human Rights. We got from 551 distinct words down to 476. Let us see it with lemmatization:

One detail: here I’m using our full pipeline to process the data. In the stemming test, I just used a simple split() module, which caused some more words in the set (probably words with a punctuation mark attached). With that considered, here are the results for the lemmatizer:

The number of distinct words in the Universal Declaration of Human Rights is: 538. After lemmatization, the number is: 486

Not too shabby! Especially considering that we now preserve part of the syntactic function of the words. Also, we’re to expect that the reduction is smaller, since we’re not reducing based on word root, but rather on the word dictionary form.

So we’re done with Lemmatizing!

Next, we take a break with new concepts to play a little with the pipeline we already have and learn more about text normalization and preprocessing.

Oh, here’s the link for the current commit.

You can always check where the project is so far by looking at the main Git Repo:

Also, don’t refrain from leaving a comment or suggestion. Also, if you find any bug in the code, you can make a commit into the repo.

[1] https://en.wikipedia.org/wiki/Lexeme

--

--

Tiago Duque
Analytics Vidhya

A Data Scientist passionate about data and text. Trying to understand and clearly explain all important nuances of Natural Language Processing.