A Closer Look at EntityRuler in SpaCy Rule-based Matching

Published in

Python in Plain English

3 min readMay 10, 2020

I have been playing with Rule-based Matching in SpaCy for a few hours. Both phrase matcher and token matcher are easy to use and produce desired results with high performance. If you are interested in checking out more, please refer to A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python

Now, I run into a question:

How to classify content better by consolidating some similar terms into the same category? E.g. IT and Information Technology are the same, or I may classify anything with Marketing or Digital Marketing into Marketing.

The answer is EntityRuler. The EntityRuler was introduced in v2.1 as a new component that allows you to add named entities based on pattern dictionaries. Entity rules can be phrase patterns for exact string matches or token patterns for full flexibility. Using EntityRuler also allows you to load patterns from disk so that you do not have to write your dozens or even hundreds of patterns in your script.

The Starting Point

I have a token matcher working that can identify “cloud” and “cloud computing”. If I find a high occurrence of these terms, I know the document is about cloud computing.

def myTokenMatcher(content):
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    print("Match_By_Token============================")
    pattern1 = [{"LOWER": "cloud"}, {"LOWER": "computing"}]
    pattern2 = [{"LOWER": "cloud"}]
    matcher.add("Match_By_Token", None, pattern1, pattern2)
    doc = nlp(content)
    matches = matcher(doc)    matchedTokens = []
    for match_id, start, end in matches:
        span = doc[start:end]
        # print(span.text)
        matchedTokens.append(span.text.lower())
    c = Counter(matchedTokens)
    for token, count in c.most_common(5):
        print('%s: %7d' % (token, count))
    return c.most_common(5)

The result tells me that this document has mentioned cloud/cloud computing 36 times (I come to this number later :D), IT/information technology 13 times. The problem with this is that:

I have to consolidate various similar tokens afterwards
if I have thousands of tokens to run with, the matcher.add will soon get ugly

Let’s try with EntityRuler and compare with the token matcher

Try with EntityRuler

You can add patterns into the EntityRuler, which is then added to a pipeline. The simple but amazing thing with EntityRuler is that you can add an ID tag into the patterns.

def myEntityRulerMatcher(content):
    nlp = English()
    ruler = EntityRuler(nlp)
    print("Match_By_Entity Ruler============================")
    patterns = [{"label": "PRODUCT", "pattern": [{"LOWER": "cloud"}, {"LOWER": "computing"}], "id": "CLOUD"},
                {"label": "PRODUCT", "pattern": [{"LOWER": "cloud"}], "id": "CLOUD"},
                {"label": "PRODUCT", "pattern": [{"LOWER": "information"}, {"LOWER": "technology"}], "id": "IT"},
                {"label": "PRODUCT", "pattern": [{"TEXT": "IT"}], "id": "IT"}
                ]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    doc = nlp(content)
    
    matchedIds = []
    for ent in doc.ents:
       matchedIds.append(ent.ent_id_.lower())
    c = Counter(matchedIds)
    for token, count in c.most_common(5):
        print('%s: %7d' % (token, count))
    return c.most_common(5)

You notice that I tag “cloud” and “cloud computing” as ID CLOUD, “IT” and “Information Technology” as “IT”. The ID is assigned to doc.ents.ent_id_ once it’s matched the pattern. Now, let’s run it:

This is exactly what I wanted! Wait a second. You may notice that the token matcher gives cloud 26 times and cloud computing 10 times, shouldn’t it be 36 in total? How come it’s still 26 here? The reason is that a phrase of “Cloud Computing” will get matched with Cloud as well as Cloud Computing. So the unique count is indeed 26! Another reason to use EntityRuler.

You may try EntityRuler(nlp).from_disk to manage a large number of patterns. I will leave it to you for now. Hope it’s helpful!

And you may check out my the other article and find out how I use SpaCy to Create an index of my archive of PPT presentations with Python

A note from Plain English

Did you know that we have launched a YouTube channel? Every video we make will aim to teach you something new. Check us out by clicking here, and be sure to subscribe to the channel 😎

A Closer Look at EntityRuler in SpaCy Rule-based Matching

The Starting Point

Try with EntityRuler

A note from Plain English

Written by Aaron Yu