Train a Custom Named Entity Recognition with spaCy v3

8 min readJun 19, 2023

A few months ago, I worked on a NER project, this was my first contact with spaCy to solve this kind of problem and so I decide to create a quick tutorial to share my knowledge acquired during this journey.

I this post, we’ll:

Learn more about spaCy architecture and pipelines.
Create a rule-based Named Entity Recognition.
See the steps to train a custom Named Entity Recognition pipeline.
Show the named entities in a visualized way.

What’s is Named Entity Recognition (NER)?

NER is a field of Natural Language Processing — NLP — to support tasks to locate and categorize key information - aka entities - from texts. Entity is a word or a sequence of words that can be categorized and extracted from an unstructured text such as person names, organizations, locations, addresses, etc.

What’s spaCy?

spaCy is a free, open-source library for advanced NLP in Python. If you’re working with a lot of text, you’ll eventually want to know more about it.

spaCy is designed specifically for production use and helps you build applications that process and understand large volumes of text.

There are a lot of features in spaCy to work with text like: tokenization; text classification; parts-of-speech (PoS) tagging; etc. In this article we’ll work with Named Entity Recognition.

Architecture

The central data structures in spaCy are the Language class, the Vocab and the Doc object.

Language class is used to process a text and turn it into a Doc object. It’s typically stored as a variable called nlp. It is created when you call spacy.load and contains the shared vocabulary of the specific language e.g. English, Portuguese, Spanish, etc.
Vocab centralizes strings, word vectors and lexical attributes.
Doc object owns the sequence of tokens and all their linguistic annotations.

See more about the spaCy Architeture

Getting Started

To prepare our development environment let’s take a look at some commands:

To keep the dependencies up-to-date, run:

!pip install -U spacy -q

Also, we are downloading the Portuguese pre-trained pipeline to demonstrate an overview of spaCy:

!python -m spacy download pt_core_news_sm -q

The list of available pipelines by language can be find here.

Finally, you are to inspect some information about the spaCy and the language pipeline.

!python -m spacy info
!python -m spacy info pt_core_news_sm

As you can see above, we are using the spaCy CLI. Here you can know more about it.

Pipelines

The spaCy’s pipeline is a set of steps to build an end-to-end resource. In short, each step in a pipeline modifies data or extracts information from it. In some cases, the previous pipeline produces the result used in the next pipeline, thus creating a dependency between them.

Below, you can see how to invoke the spaCy pipeline and check the existing components in the pipeline:

import spacy


nlp = spacy.load("pt_core_news_sm")

print(nlp.pipe_names)

tok2vec and ner are the built-in pipeline components used to extract the named entities and populate the ents property from the Docobject.

When you call nlp on a text, spaCy first tokenize the text to produce a Doc object. The Doc is then processed in several different components — this is also referred to as the processing pipeline. spaCy provides a range of built-in pipeline components for different languages and also allows adding custom pipeline components. A pipeline custom component is a function that receives a Doc object, modifies it and returns it.

text = """
O Bitcoin (BTC) recuperou parte das perdas registradas em meio à 
batalha regulatória.
"""


doc = nlp(text)
print(doc.ents)

We can create our own pipelines to improve our tasks. If you need to recognize custom named entities, you probably need to train a new pipeline.

More about Named Entity Recognition (NER)

Named Entity Recognition or NER is a way to find real-world objects, like persons, companies or locations in a text. We can recognize various types of named entities in a document. This doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

doc = nlp("""
Meu nome é Johnny B. Goode e hoje estou
tocando em Hollywood no Teatro Álvaro de Carvalho
""")

for ent in doc.ents:
  print(f"{ent.label_} : {ent.text}")

The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Visualizing named entities

spaCy also comes with a built-in named entity visualizer that lets you check your model’s predictions in your browser. You can pass in one or more Doc objects and start a web server, export HTML files or view the visualization directly from a Jupyter Notebook.

Using spaCy’s built-in displaCy visualizer you can explore an entity recognition model’s behavior interactively. If you’re training a model, it can also be incredibly helpful in speeding up development and debugging your code and training process.

from spacy import displacy


colors = {"PER": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"colors": colors}

displacy.render(doc, style="ent", jupyter=True, options=options)

This tool will help you to save time while you are developing or debugging a new pipeline model.

Training a new pipeline

Some spaCy’s components are powered by statistical models, the prediction is based on the model’s current weight values. The weight values are estimated based on examples the model has seen during training.

The training data is context dependent, that the tokens around your entities are taken in account when finding the entity.

Also, your raining data should always be representative of the data you want to process.

This also means that, in order to know how the model is performing, and whether it’s learning the right things, you don’t only need training data — you’ll also need evaluation data.

The recommended way to train your spaCy pipelines is via the spaCy train command line.

To show how NER works in spaCy we are using this text written in Brazilian Portuguese.

Using pre-trained pipelines

The spaCy v3 trained pipelines are designed to be efficient and configurable.

The pipelines are designed to be efficient in terms of speed and size and work well when the pipeline is executed.

Let’s see the performance of pre-trained model:

import spacy
import urllib


url = "https://gist.githubusercontent.com/johnidm/157acebd00fcb70d8044b43cc02ab884/raw/99a97a9d1f866dab9e2b54378f039fc435ffbf4e/document.txt"

nlp = spacy.load("pt_core_news_sm")

document = urllib.request.urlopen(url).read().decode("utf-8")
doc = nlp(document)

spacy.displacy.render(doc, style="ent", jupyter=True)

As you can see, all entities have been highlighted for easy identification in the text.

Using rule-based entity recognition

It is possible to use rule-based entity recognition to find entities. The EntityRuler is a pipeline component that’s typically added via nlp.add_pipe.

When the nlp object is called on a text, it will find matches in the doc and add them as entities to the doc.ents, using the specified pattern label as the entity label.

See how to do that in the example below:

import spacy
import urllib


nlp = spacy.load("pt_core_news_sm")

url = "https://gist.githubusercontent.com/johnidm/157acebd00fcb70d8044b43cc02ab884/raw/99a97a9d1f866dab9e2b54378f039fc435ffbf4e/document.txt"

entity_ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {
        "label": "CRYPTO",
        "pattern": [
            {
                'LOWER': { 
                    'IN': ['bitcoin', 'tether', 'ether', 'ethereum']
                }
            }
        ]
    }
]

entity_ruler.add_patterns(patterns)


document = urllib.request.urlopen(url).read().decode("utf-8")
doc = nlp(document)

spacy.displacy.render(doc, style="ent", jupyter=True)

Sometimes other solutions like regular expression are simpler.

Rule-based systems are a good choice if there’s a more or less finite number of examples that you want to find in the data, or if there’s a very clear, structured pattern you can express with token rules or regular expressions. For instance, country names, IP addresses or URLs.

For complex tasks, it’s usually better to train a statistical entity recognition model. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.

Training a custom NER pipeline

To train your custom NER, you need to follow 4 steps:

Create a data annotation set.
Convert the data to binary format.
Create a config file.
Run the training process.

Data annotation set

In a real-world problem you need to use a NER annotation tool, like doccano or similar to create your dataset.

The format required to training a custom NER should be like that:

[
    "My name is Johnny Kolly and I'm a producer. My boss is Jimmy Smith.",
    
    {
        "entities": [
            (11,23,"PERSON"),
            (55,66,"PERSON"),
            (34,42,"JOB")
        ]
        
    }
]

The entities arePERSON and JOB note that the exactly position of the entities appear in the text.

I prepared a dataset to use in this post. Let’s load the dataset and split it in training and dev set:

import json, urllib


url = "https://gist.githubusercontent.com/johnidm/0971d537443515fce71ab28907ecaef5/raw/f1cc41b94345516720bcc98c1984581f028b9486/dataset.json"

data = json.loads(urllib.request.urlopen(url).read().decode("utf-8"))

dataset = data["annotations"]
TRAIN_DATA = dataset[:30]
DEV_DATA = dataset[30:]

Convert data to spaCy’s binary training format

Convert files into spaCy’s binary training data format, a serialized DocBin, for use with the train command and other experiment management functions. The converter can be specified on the command line, or chosen based on the file extension of the input file.

The binary format is extremely efficient in storage, especially when packing multiple documents together.

Training data for NLP projects comes in many different formats such as CoNLL, IOB, etc.

Let’s convert the raw data to .spacy format. See the code snippet below:

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm


def convert(path, dataset):
    nlp = spacy.blank("pt")
    db = DocBin()
    for text, annot in tqdm(dataset): 
            doc = nlp.make_doc(text) 
            ents = []
            for start, end, label in annot["entities"]:
                span = doc.char_span(start, end, label=label, alignment_mode="contract")
                if span is None:
                    print("Skipping entity")
                else:
                    ents.append(span)
            doc.ents = ents 
            db.add(doc)
    db.to_disk(path)
    
convert("train.spacy", TRAIN_DATA)
convert("dev.spacy", DEV_DATA)

Now, we have the data ready for training!

Let’s train a NER model by adding our custom entities.

Config file

Configuration file that includes all settings and hyperparameters.

The recommended config settings generated by the quickstart widget and the init config command are based on some general best practices and things we’ve found to work well in our experiments. The goal is to provide you with the most useful defaults.

!python -m spacy init config config.cfg --lang pt --pipeline ner --optimize efficiency --force

After you save the config file you are able to start the train process.

Train process

Expects data in spaCy’s binary format and a config file as input and will save out the best model from all epochs and last model, as well as the final pipeline.

!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./dev.spacy

Once you finish the training steps the model can be loaded to use.

Loading the model

Congratulations, now you are able to use your trained pipeline to extract your custom entities. The pipeline is saved on /model-best directory. Load it and test it as follow:

import urllib


url = "https://gist.githubusercontent.com/johnidm/157acebd00fcb70d8044b43cc02ab884/raw/99a97a9d1f866dab9e2b54378f039fc435ffbf4e/document.txt"

document = urllib.request.urlopen(url).read().decode("utf-8")
document[:60]

import spacy


nlp = spacy.load("model-best")

doc = nlp(document)

colors = {"CRYPTO": "linear-gradient(315deg, #f5d020, #f53803)"}
options = {"ents": ["CRYPTO"], "colors": colors}


spacy.displacy.render(doc, style="ent", options=options, jupyter=True)

To see the final result, check out the playable notebook here.

Conclusion

I hope you have learned how to use NER in spaCy. The goal here is to understand how to train a custom NER, how to visualize entities in a text document and how to load the final model. This tasks are important to build your own NER pipeline in spaCy.

If you found this content useful clap it👏 and subscribe to my blog.