Building a custom Named Entity Recognition model using spaCy —Training a Model — Part 2

Johni Douglas Marangon
4 min readNov 21, 2023

--

In today’s post, we will learn how to train a NER. In the previous post we saw the comprehensive steps how to get the data and make the annotations, now we will use this data to create our custom model with spaCy.

At the end of this article, you will be able to train a NER model with a custom dataset.

I recommend to look at the post — Train a Custom Named Entity Recognition with spaCy v3 — In this post, I explain in more details about the steps involved in training a custom NER model. Here we will be more direct.

Loading the dataset

To start, we will download the full dataset made using the previous post. This dataset contains all 757 documents annotated with the affiliation label. The DOI entity will be extracted by a string pattern.

Download the dataset:

wget https://gist.githubusercontent.com/johnidm/1cb74e2603177d3d3d554b6b1fe79728/raw/49e83c8178a16a76518894af8c00c2f4fb80a171/dataset.jsonl

Load the dataset from the file:

import json


with open('dataset.jsonl', 'r') as f:
lines = list(f)

training_data: list = []

for line in lines:
row = json.loads(line)
training_data.append( [ row["text"], { "entities": row["label"] } ] )

print(len(training_data))

Split the data into training and dev set:

train_split = int(len(training_data) * 0.8) # 80% training and 20% deve set

train_data = training_data[:train_split]
dev_data = training_data[train_split:]

Convert the dataset into spaCy binary format:

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm


def convert(path, dataset):
nlp = spacy.blank("en")
db = DocBin()
for text, annot in tqdm(dataset):
doc = nlp.make_doc(text)
ents = []
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping nil entity")
if span.text != span.text.strip():
print("Skipping entity spans with whitespace")
else:
ents.append(span)
doc.ents = ents

db.add(doc)
db.to_disk(path)

convert("train.spacy", train_data)
convert("dev.spacy", dev_data)

You need to have this important files available : train.spacy and dev.spacy, this files contain all data in a spaCy binary format to be used in the training step.

Training a NER

In this section, we will apply a sequence of processes to train a NER model in spaCy. We will use the training data to teach the model to recognize the affiliation entity and classify it in a text document.

Install the spaCy library:

pip install spacy -q

Check information about the spaCy environment:

python -m spacy info

Create a config file named config.cfg.

python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force

Check if the data is ready to use:

spacy debug data config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

Finally, train the model:

python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./dev.spacy

Once the pipeline was trained, it will store the best model in the output directory which we can load to test on other examples.

Test the model

To test our trained model we will download a random paper in PDF format, extract the text and apply the model. If necessary install the library to convert the PDF into text:

pip install pdfminer.six -q

Download the paper, feel free to choose another file:

wget https://www.theoj.org/joss-papers/joss.05160/10.21105.joss.05160.pdf

Extract DOI entity

As I mentioned in the previous post, the DOI entity can be extracted using a string pattern. In spaCy, entity rules is a way to apply patterns to recognize named entities in text. These rules consist of a combination of tokens and entity labels that define the structure and characteristics of named entities.

See the example how to apply a rule to identify a DOI entity:

from pdfminer.high_level import extract_text
import spacy

text = extract_text("10.21105.joss.05160.pdf")

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

patterns = [
{
"label": "DOI",
"pattern": [
{"LOWER": {"REGEX": "\d{2}\.\d{5}"}},
{"TEXT": "/"},
{"LOWER": {"REGEX": "joss\.\d{5}"}},
],
}
]
ruler.add_patterns(patterns)

doc = nlp(text)

for ent in doc.ents:
print(ent.label_, ent.text)

They are a complementary approach to spaCy’s statistical NER models. By creating and applying these rules, users can “teach” the spaCy to identify custom entities based on specific patterns found on the text. This allows for more accurate and customizable entity recognition.

Extract affiliation entity

Run the code below to identify the affiliation entity in the paper using the trained model:

from pdfminer.high_level import extract_text
import spacy


text = extract_text("10.21105.joss.05160.pdf")

nlp = spacy.load("model-best")
doc = nlp(text)

for ent in doc.ents:
print(ent.label_, ent.text)

For better visualization, the NER entities in the document use displaCy:

colors = {'AFFILIATION': "#FFFF33"}
options = {"ents": ['AFFILIATION'], "colors": colors}

spacy.displacy.render(doc, style="ent", jupyter=True, options=options)

As you can see, the NER model performs very well to extract the entities.

Closing Remarks

As you can see, there is no difficulty to train a NER model. It is really easy to do that. Congratulations spaCy team for it.

The model result in the model-best folder will be used in the next post to write a RESTful API.

You can find the reproducible notebook here.

See you in the final post about this series. Happy learning.

--

--