An Overview of spaCy’s Token Matcher and Phrase Matcher

Maximinusjoshus
featurepreneur
Published in
3 min readApr 8, 2021

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. It features NER, POS tagging, dependency parsing, word vectors, and more. This article will explore spaCy’s token matcher and phrase matcher.

The Token Matcher

spaCy features a rule-based matching engine, the Matcher, that operates over tokens, similar to regular expressions. The Matcher allows us to specify rules to match, which includes flags such as IS_PUNCT, IS_DIGIT, etc.

When we pass a sentence into a nlp pipeline in spacy, it returns a doc object which contains the same sentence but with attributes added to it. Each word and punctuation is treated as a token in the doc object. Say if we want to match “March 4, 2021” in a sentence, we have to specify the match pattern for each token in our string, in our case “March”, “4”, “,” “2021”.

  1. A token whose lowercase form matches “march” eg: “march” or “March”.
  2. A token whose IS_DIGIT flag is set to true. i.e, any number.
  3. A token whose IS_PUNCT flag is set to true. i.e, any punctuation.
  4. A token whose IS_DIGIT flag is set to true.

A list of dictionaries should be created in which each dictionary contains the match pattern for one token.

pattern = [{"LOWER": "march"}, {"IS_DIGIT": True}, {"IS_PUNCT": True}, {"IS_DIGIT": True}]

Now let us implement the matcher to find our string in a sentence.

import spacy# Import the Matcher
from spacy.matcher import Matcher
# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
pattern = [{"LOWER": "march"}, {"IS_DIGIT": True}, {"IS_PUNCT": True}, {"IS_DIGIT": True}]
matcher.add("DATE_PATTERN", None, pattern)# Process some text
doc = nlp("SpaceX's Starlink 17 mission lifts off on a Falcon 9 rocket from Launch Complex 39A at NASA's Kennedy Space Center in Florida, on March 4, 2021")
# Call the matcher on the doc
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])

This code snippet will print “March 4, 2021” which is the matched string.

March 4,2021

The Phrase Matcher

The phrase matcher can be used when large terminologies have to be matched. It functions the same way as the token matcher, but instead of specifying rules and patterns, we can input strings to match!

Say we have to match a list of fruit names in a sentence,

fruit_list = ['apple','orange','banana',]

First we have to obtain the doc object for each fruit name. We store the doc objects in a list and pass it as the pattern list for the phrase matcher.

patterns = [nlp(fruit) for fruit in fruit_list]
matcher.add('FRUIT_PATTERN',patterns)

Now let us implement our pattern to match the fruit names in a sentence.

import spacy#import the phrase matcher
from spacy.matcher import PhraseMatcher
#load a model and create nlp object
nlp = spacy.load("en_core_web_sm")
#initilize the matcher with a shared vocab
matcher = PhraseMatcher(nlp.vocab)
#create the list of words to match
fruit_list = ['apple','orange','banana',]
#obtain doc object for each word in the list and store it in a list
patterns = [nlp(fruit) for fruit in fruit_list]
#add the pattern to the matcher
matcher.add("FRUIT_PATTERN", patterns)
#process some text
doc = nlp("An orange contains citric acid and an apple contains oxalic acid")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span.text)

This code snippet prints the matched words ‘apple’ and ‘orange’ in the output.

apple
orange

And this is how we can use the Token Matcher and the Phrase Matcher to match strings in a document. Happy matching!

--

--