spaCy matchers guidelines

Jade Moillic
Besedo Engineering Blog
7 min readJul 11, 2022

Co-authored by Roxane Bois

In this post, we will look at the matchers that can be used in spaCy to be able to create semantic and/or syntactic filters.

❓What is spaCy?

spaCy is an NLP tool that can support more than 64 languages. It can be used to train models and linguistically analyze texts via processes like NER, POS-tagging, morphological analysis, and lemmatization. The major advantage of using spaCy is that it is a very fast NLP tool, and you can customize its pipeline.

What will be of interest for us here is, firstly, the Language Processing Pipelines that consist of four major steps, represented in the following table:

spaCy’s pipeline

If you are unfamiliar with what these different components do, here are some small definitions, you can also find more information on spaCy’s website. Note that the lemmatization step is also really important to create a filter, so we will also be presenting it.

  • tokenizer: Segment text into tokens, e.g. “I visited Paris” → “I”, “visited”, “Paris
  • tagger: Assign part-of-speech tags, e.g. “I” → “PRON” (pronoun), “visited” → “VERB”…
  • parser: Assign dependency labels, e.g. “visited” is the head of the sentence, “I” is the subject, “Paris” the object.
  • ner: Detect and label named entities, e.g. “Paris” → “gpe” (gpe=Geopolitical Entity)
  • lemmatizer: Assign base forms of the words, e.g. “visited” → “visit

💡 If you want to know more about this and how to use it, we invite you to follow the spaCy course Chapter 1: Finding words, phrases, names and concepts from lessons 1 to 9.

🔍 spaCy’s Matchers

spaCy allows you to use two different kinds of matchers, that have different attributes: PhraseMatcher and Matcher.

💡 You can find more information about both of them and their differences here.

spaCy PhraseMatcher

The PhraseMatcher of spaCy allows you to match lists of tokens on a text. This matcher is perfect for matching simple patterns, such as simple tokens or several simple tokens.

💡 The spaCy page linked to this matcher can be found here.

spaCy Matcher

spaCy Matcher is a rule-based matcher pretty interesting for creating semantic filters. Compared to regex, spaCy Matcher not only finds some parts of text in data but can also find any information or sequence of information contained in its pipeline (POS, syntactic tags, NER, lemmas, …).

💡 The documentation of Matcher is linked here. If you want to learn how to use spaCy Matcher, we invite you to follow the spaCy course Chapter 1: Finding words, phrases, names and concepts from lessons 10 to 12.

💻 How to use both Matchers?

Install spaCy

spaCy is a python library which means that you can run the following lines in your terminal to download it. For specific cases, you can follow this guide for download.

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

The two first lines allow you to install the library and the resources it requires: the English dictionary in small size sm. You can also download a medium and a large size but the larger the model, the longer it will take to run the pipeline.

💡 If you need to process other languages, you can search for the right dictionary here.

Parse your doc with spaCy

First, you need to import spaCy and load the dictionary you want to use.

import spacy
nlp = spacy.load("en_core_web_sm")

spaCy transforms a text (string instance) into a doc (spaCy token object), so you can access all information you need from its pipeline. There are multiple ways to apply spaCy to a text, but we advise using .pipe() which is faster in time.

💡 Learn more about pipelines.

text = "This is a spaCy test."
doc = nlp.pipe(text)

Then, you can call the information you need thanks to specific linguistic attributes. Here we print the raw text, the POS tag, and the lemma.

💡 Learn more about linguistic attributes.

for token in doc :
print(token.text, token.pos_, token.lemma_)

❗ Do not forget the _ at the end of some attributes, otherwise, it returns the hash value of the token

Create a filter with the matchers

Import the matchers

First, you need to import spaCy matchers in addition to the spaCy library and the dictionary we imported above.

from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher

Initialize the matchers

You firstly need to initialize the matchers with the vocabulary of spaCy.

matcher = Matcher(nlp.vocab)
phrase_matcher = PhraseMatcher(nlp.vocab)

💡 You can also initialize the PhraseMatcher with an attribute to specify the token attribute you want to match on, e.g. attr="TEXT" allows you to match the verbatim of the token while attr="LEMMA" matches the lemma of the text.

Lemmas / Patterns to give to the matchers

  • Patterns to match

You can now add some lemmas and patterns to each matcher. The way of creating them differs for both matchers.

PhraseMatcher takes a list of one or several tokens:

lexicon = ["like", "love", "i like", "loving"]
lexicon_2 = ["do not like", "enjoy"]

❗ By security, you can lemmatize the lists you have just created with spaCy to make sure that the lists contain the good lemmas to match. For example in the list lexicon, loving will become love. This is useful as loving is not a lemma and therefore will never be matched. As it is already in the list, do not forget to delete duplicates as a final step.

lexicon will then look like this: lexicon = ["like", "love", "I like"]

Matcher takes a list of patterns that are a list of dictionaries.

💡 You can find all you need about the pattern format in the Matcher documentation.

patterns = [
# A pronoun + "love"
[{"POS": "PRON"}, {"LEMMA": "love"}],
# A pronoun + the verb "like"
[{"POS": "PRON"}, {"LEMMA": "like", "POS": "VERB"}],
]
patterns_2 = [
# a pronoun + "love" with 1 or 0 word between them (ex : "I really love")
[{"POS": "PRON"}, {"IS_ALPHA": True, "OP": "?"}, {"LEMMA": "love"}]
]

You can then add your lexicons and patterns to the matchers initialized. You will need to specify an ID key and one or more patterns. If the ID already contains some patterns, the list of patterns will be extended.

#Add the lexicons to the Phrase matcher
phrase_matcher.add("some_lemmas", [nlp(word) for word in lexicon])
phrase_matcher.add("more_lemmas", [nlp(word) for word in lexicon_2])
#Add the patterns to Matcher
matcher.add("some_patterns", patterns)
matcher.add("1_more_pattern", patterns_2)

💡 By adding the patterns, the attribute greedy=”LONGEST” helps to not match multiple times some patterns with for instance an “OP”:“?” operator. Only the longest is taken into account.

Apply the matchers

After parsing your doc with spaCy, you just have to apply your matchers to it. Know that it is also possible to directly add the matchers to the spaCy pipeline, the steps on how to do that will be detailed later.

# Parse the doc
doc = nlp("This is a spaCy test.")
# Apply your Matchers to the doc
matches = phrase_matcher(doc)
matches += matcher(doc)

The matches come in tuples : (hash_value, start_index, end_index). You can access the results with the following :

# Print the Hash value, start index and end index of the matches
for result in matches:
print(result)
# Print the number of matches
print("Total matches found:", len(matches))
# Print results
for match_id, start, end in matches:
# Prints the rule id
print(nlp.vocab.strings[match_id])
# Prints the text matched
print(doc[start:end])

💡 It is also possible to return the spans more simply when adding the as_spans=True attribute when initializing the matchers (e.g. matcher = Matcher(nlp.vocab, as_spans=True)). This will return a list of Span objects using the match_id as the span label.

📤 Open external files as matchers

In some cases, it will be easier to use external files as matchers. In that case, you can follow these guidelines on how to store and open them.

Store the matchers in a text file

To be able to store the matchers and use them easily later, it is better to store them in a txt file. The way to put the information in the txt is similar for both matchers: for the PhraseMatcher, we will have one line for one (or more) token, and for the Matcher, we will have one line per pattern.

Let’s take the same examples as earlier:

lexicon = ["like", "love", "I like"]patterns = [
# A pronoun + "love"
[{"POS": "PRON"}, {"LEMMA": "love"}],
# A pronoun + the verb "like"
[{"POS": "PRON"}, {"LEMMA": "like", "POS": "VERB"}],
]

This lexicon can be outputted in a txt file this way:

like
love
I like

The patterns linked to the Matcher will be put in this format in the txt file:

[{"POS":"PRON"}, {"LEMMA":"love"}]
[{"POS":"PRON"}, {"LEMMA":"like", "POS":"VERB"}]

Open the matchers

To open the files and put the contents in the matchers, you will need the following code:

➡️ Use the matchers in spaCy’s pipeline

Add the matchers component and output the pipeline

To create a pipeline containing a component for the matchers, we can add it to the pipeline and output it. The implementation here is different than what we showed earlier as we won’t have a count of the matches but a dictionary present in doc.cats of each document. This change aims to allow different outputs according to the info we want to catch. Here, we are interested in the number of matches for one doc (n_matches), the spans matched (spans) and the filter that matched the previously mentioned span (filter_name). To do this, it is possible to use the following code:

Once the code is launched, you will be able to find your new pipeline in the path where you outputted the pipeline: nlp.to_disk("pipeline_matcher").

Use the customized pipeline

To be able to use the precedently outputted pipeline, you will need to input or add the precedent Class Matchers to your code and open the matchers with the python lines in Open the matchers. Then, all you need to do is load the pipeline via the following line:

nlp = spacy.load("pipeline_matcher")doc = nlp("Do you like it or do you love it ?")
print(doc.cats)
{'n_matches': 2,
'spans': [like,love],
'filter_name': ['Tokens','Patterns']}

We really hope this blog post was useful to you and wish you a good look on your adventures with spaCy matchers!

--

--