How to Extract Named Entities from Text using Spacy Rule-Based Matching

Published in

birdie.ai

4 min readJul 5, 2021

In NLP, a plethora of tasks requires the extraction of entities from texts based on a pattern. We have a lot of options and the best solution depends on the goal and the project needs. In this article we’ll discuss a baseline solution for this extraction using Rule-Based Matching with the Spacy tool, first describing the installation steps, the definition of rules, the extraction code using python, and a few notes about how to improve the results.

What is Rule-Based Matching

Rule-Based Matching is a technique of text extraction using predefined rules that identify entities according to the pattern. With Spacy we can achieve this by using the “Matcher” class that lets us define those rules and get the result we need. An important note is that we can use literal words, part-of-speech tags, and lexical attributes to create patterns.

Install

First, we need to install Spacy 3.0. It can be easily done by running those two commands below that install the library and the Spacy English pipeline.

pip install -U spacy 
python -m spacy download en_core_web_sm

Definition of Rules

This is an important part of the task. We’ll define two basic rules just to introduce the problem, but it’s better to implement more depending on what exactly you’re expecting to extract.

For instance, I’ll consider “attributes” as products entities described from an e-commerce comments session used by consumers.

We can also use Spacy Matcher Explorer to test our rules and see the dependency structure.

Rule 1

Comment: “Great smartphone. I love the screen size.”

Important attributes: “smartphone” and “screen size”.

We can create the rules:

Smartphone = Noun
Screen Size = Compound + Noun

We have a part-of-speech (POS) tag “noun” following an optional dependency label (DEP) “compound”. To make the DEP optional to appear, we can use the operator (OP) “?”. Let’s call that rule “Noun and compound” and write that in a python dictionary listed later in this article.

Rule 2

Comment: “This phone is water-resistant?”

Important attributes: “phone” and “water-resistant”.

We can create the rules:

Phone = Noun
Water Resistant = Noun + Adjective

For that case, we have a new rule. A part-of-speech (POS) tag “noun” followed by another part-of-speech (POS) tag “adjective”. Let’s call that rule “Noun and adjective”.

Spacy Rule Definition

Let’s join both rules in a single python dictionary and use it in our model in the next session.

Model

After defining all rules to extract our attributes, we need to code the matcher responsible for extracting it according to what we want. We can create a “Matcher” Spacy object and add all rules defined previously. Now the model is ready for extraction when we input a text.

Extraction

The model is ready and we’re able to extract attributes using the code listed below.

With that, we achieve our goal to extract important attributes from text considering smartphones reviews context.

How to improve

This is a baseline solution for extracting entities from text and it can be a fast solution to a lot of problems. With that in mind, it’s very restricted and has good results only with a very structured pattern. If that’s your case, it can be good enough for you.

The upside is that it doesn’t require training data, so it can save a lot of time for the user since it’s more practical to define a rule that matches most scenarios and get good results from it.

If you’re dealing with a more complex task, it’s recommended to use a statistical model that can learn from a more complex pattern, even if that requires training data. Still talking about Spacy, we can find other tools to solve that problem, for instance, using “Named Entity Recognition” (NER).

References

Rule-Based Matching: https://spacy.io/usage/rule-based-matching

Linguistic Features: https://spacy.io/usage/linguistic-features