spaCy’s lemmatizer: lowercase limitations

Jade Moillic
Besedo Engineering Blog
6 min readNov 28, 2022
Lemmatization and lowercase limitations

Why are uppercases a problem?

spaCy matchers work with attributes and one of them is the lemma of the word. You can find more information about the matchers and how to use them in our other blog post:

Lemmatization is the process of turning a word into its canonical form, which is the form of a word you find in a dictionary. For example, the lemma of a verb will be its infinitive form: I was → I be.

The way the matchers work is really nice when you have clean data, but some of the data can be noisy and preclude the matchers to work correctly. This is the case of uppercases in a word. The presence of unexpected uppercases in a word can disrupt the process of pos-tagging, and more importantly the lemmatization of a word. The following code will be launched using spacy 3.4.

For instance, let’s take this sentence where every word begins with an uppercase: I Love Dogs And Cats !

import spacy
nlp = spacy.load("en_core_web_sm")
text = "I Love Dogs And Cats !"
doc = nlp(text)
print("Word\\tLemma\\tPos")
for el in doc:
print(el.text, el.lemma_, el.pos_, sep="\\t")

Word Lemma Pos
I I PRON
Love love VERB
Dogs Dogs PROPN
And and CCONJ
Cats cat NOUN
! ! PUNCT

Here, you can see that the uppercases don’t cause trouble, except for the word Dogs. Here, this word is recognized as a proper noun and its lemmatization does not work properly as its lemma should be dog.

Now let’s imagine that a matcher needs to find the lemma of cat and dog, the first one will be matched and not the second one.

Ways to solve this problem

This problem is already addressed in the spaCy community and we have gathered here information about several ways to solve the above-presented problem.

The easiest way to solve this problem is to lowercase the entire text. However, by doing that we would lose a lot of information that could lead to problems in the pos-tagging and even in the lemmatization.

Use the lower attribute in the matcher

The first solution, and probably the easiest, would be to use the lower attribute, that matches the words in their lower form, as indicated in this discussion.

The problem here is that the lower attribute is linked to the current form of a word and not its lemma. The lemma is the form of a word that is used to represent all its other forms, this form is the entry of the word as you can find it in the dictionary (e.g. the infinitive for a verb). Not using the lemma would mean that, for a given word, we would need to list its every form, which would take time and more resources.

A solution would be to use both lower and lemma attributes, but it is not possible according to this discussion of spaCy’s GitHub.

Use Regex

Someone opened an issue on the Prodigy Support page (Prodigy is an annotation tool created by the makers of spaCy) in August 2020.

Here, the suggested idea is to use the REGEX operator of the PhraseMatcher and add the uppercase and lowercase version of the word. This would resemble to this:

{
[{"lemma": {"regex": "(?i)cat"}}],
[{"lemma": {"regex": "(?i)dog"}}]
}

This solution may be good to some, but in our case, it would be hard to use this solution for us. Indeed, our filters are made of a lot of patterns and using regexes can really increase the inference time, which is something we cannot allow. Plus, our idea was to use the simpler matcher, called Matcher, that works with simple word lists*.*

Lower the lemma after the lemmatization

Another idea is to lowercase the lemma before applying the filters, right after the tagger in the pipeline. The code to do that was shared in this forum:

# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
for token in doc :
token.lemma_ = token.lemma_.lower()
return doc
# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")

To illustrate what it would do, here is an example:

Dogs -> lemmatizer -> Dogs -> lowercase -> dogs

As you can see from the above example, this would allow us to catch the lowercase lemma.

The issue with this solution is that the wanted lemma of Dogs is dog and not dogs. If we create a matcher that is supposed to catch the lemma dog, it will not be caught with this solution.

Change the implementation of the lemmatizer

An option would be to have a custom Lemmatizer directly in spaCy’s code. This solution is mentioned in the previous forum, but also in a discussion we opened on spaCy’s GitHub which can be found here.

For us, this would be the best option, and the following part should help you implement this solution.

❗ With this solution, every proper noun will be undercased so this may cause issue. We suggest thinking about the more preferable way for your use case to lowercase text.

How to change the lemmatizer’s implementation

As we said in the precedent version, the best option for us is to change the implementation of the lemmatizer in spaCy’s source code. This would allow us to lowercase the word before searching for the lemma in spaCy’s database.

How to change the lemmatizer’s implementation

As we said in the precedent version, the best option for us is to change the implementation of the lemmatizer in spaCy’s source code. This would allow us to lowercase the word before searching for the lemma in spaCy’s database.

What needs to change?

Here, the function we want to change is the rule_lemmatizer() function in the lemmatizer.py. The following code will show you the different things we changed, and we will go into detail.

First of all, we modified the way the functions are called if the mode of the lemmatizer is rule or rule_propn so the __rule_lemmatize__ functions gets a boolean attribute replace_propn that’s False by default and True if we use the rule_propn mode of the lemmatizer.

Then, the main thing we want to do is change the pos-tag of the proper nouns (PROPN) to a noun (NOUN) in the lemmatizer, so it is treated the same way and the good lemma is found while keeping in the output the pos-tag as a proper noun. This has to be done in different places:

  • Change the univ_pos:

The univ_pos is the lowercase pos tag of the word, here we want to change it to a noun so it will search for the lemma of the said noun.

  • Change the token.pos_:

This needs to be done because the function is_base_form is working with the token. For example, without this, the lemma of ass would be as. The way the function is working is different for every language but here’s an example for English from spaCy’s GitHub.

The pos_ then needs to be changed into PROPN then so it will output the right POS at the end of the lemmatization and we don’t change it.

  • Append orig.lower():

When spaCy’s rules don’t match any lemma, it uses the form of a word (the string). We added this line to specify that when a lemma is unknown, spaCy will return the lowercase version of the word.

The last step is to pass the new mode into the config file of the pipeline to be able to use it. You can do that when loading the pipeline by adding a config attribute:

nlp = spacy.load("en_core_web_sm", config={"components.lemmatizer.mode":"rule_propn"})

Here is an example of a result of the pipeline as we changed it:

**Word**   **Lemma**   **Pos**
I I PRON
Love love VERB
Dogs dog PROPN
And and CCONJ
Cats cat NOUN
! ! PUNCT

Guide to integrating the changes

For now, the way to add these changes is to modify lemmatizer.py in the folder where we downloaded spaCy. In order to do that, you can use git patch, which allows you to share the changes you made on a file instead of the entire file. This patch then needs to be applied to spaCy’s lemmatizer. To find the lemmatizer script on our computer you can use this python code:

import spacy
print(spacy.__path__)

The path will be found in the path variable, which will allow you to modify it.

We hope this blog post was useful to you and that you found the best solution for you on how to lowercase spaCy’s lemmatizer.

--

--