Building a custom Named Entity Recognition model using spaCy —Training a Model — Part 2
In today’s post, we will learn how to train a NER. In the previous post we saw the comprehensive steps how to get the data and make the annotations, now we will use this data to create our custom model with spaCy.
At the end of this article, you will be able to train a NER model with a custom dataset.
I recommend to look at the post — Train a Custom Named Entity Recognition with spaCy v3 — In this post, I explain in more details about the steps involved in training a custom NER model. Here we will be more direct.
Loading the dataset
To start, we will download the full dataset made using the previous post. This dataset contains all 757 documents annotated with the affiliation label. The DOI entity will be extracted by a string pattern.
Download the dataset:
wget https://gist.githubusercontent.com/johnidm/1cb74e2603177d3d3d554b6b1fe79728/raw/49e83c8178a16a76518894af8c00c2f4fb80a171/dataset.jsonl
Load the dataset from the file:
import json
with open('dataset.jsonl', 'r') as f:
lines = list(f)
training_data: list = []
for line in lines:
row = json.loads(line)
training_data.append( [ row["text"], { "entities": row["label"] } ] )
print(len(training_data))
Split the data into training and dev set:
train_split = int(len(training_data) * 0.8) # 80% training and 20% deve set
train_data = training_data[:train_split]
dev_data = training_data[train_split:]
Convert the dataset into spaCy binary format:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
def convert(path, dataset):
nlp = spacy.blank("en")
db = DocBin()
for text, annot in tqdm(dataset):
doc = nlp.make_doc(text)
ents = []
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping nil entity")
if span.text != span.text.strip():
print("Skipping entity spans with whitespace")
else:
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk(path)
convert("train.spacy", train_data)
convert("dev.spacy", dev_data)
You need to have this important files available : train.spacy
and dev.spacy
, this files contain all data in a spaCy binary format to be used in the training step.
Training a NER
In this section, we will apply a sequence of processes to train a NER model in spaCy. We will use the training data to teach the model to recognize the affiliation entity and classify it in a text document.
Install the spaCy library:
pip install spacy -q
Check information about the spaCy environment:
python -m spacy info
Create a config file named config.cfg
.
python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force
Check if the data is ready to use:
spacy debug data config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
Finally, train the model:
python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./dev.spacy
Once the pipeline was trained, it will store the best model in the output directory which we can load to test on other examples.
Test the model
To test our trained model we will download a random paper in PDF format, extract the text and apply the model. If necessary install the library to convert the PDF into text:
pip install pdfminer.six -q
Download the paper, feel free to choose another file:
wget https://www.theoj.org/joss-papers/joss.05160/10.21105.joss.05160.pdf
Extract DOI entity
As I mentioned in the previous post, the DOI entity can be extracted using a string pattern. In spaCy, entity rules is a way to apply patterns to recognize named entities in text. These rules consist of a combination of tokens and entity labels that define the structure and characteristics of named entities.
See the example how to apply a rule to identify a DOI entity:
from pdfminer.high_level import extract_text
import spacy
text = extract_text("10.21105.joss.05160.pdf")
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{
"label": "DOI",
"pattern": [
{"LOWER": {"REGEX": "\d{2}\.\d{5}"}},
{"TEXT": "/"},
{"LOWER": {"REGEX": "joss\.\d{5}"}},
],
}
]
ruler.add_patterns(patterns)
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent.text)
They are a complementary approach to spaCy’s statistical NER models. By creating and applying these rules, users can “teach” the spaCy to identify custom entities based on specific patterns found on the text. This allows for more accurate and customizable entity recognition.
Extract affiliation entity
Run the code below to identify the affiliation entity in the paper using the trained model:
from pdfminer.high_level import extract_text
import spacy
text = extract_text("10.21105.joss.05160.pdf")
nlp = spacy.load("model-best")
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent.text)
For better visualization, the NER entities in the document use displaCy:
colors = {'AFFILIATION': "#FFFF33"}
options = {"ents": ['AFFILIATION'], "colors": colors}
spacy.displacy.render(doc, style="ent", jupyter=True, options=options)
As you can see, the NER model performs very well to extract the entities.
Closing Remarks
As you can see, there is no difficulty to train a NER model. It is really easy to do that. Congratulations spaCy team for it.
The model result in the model-best
folder will be used in the next post to write a RESTful API.
You can find the reproducible notebook here.
See you in the final post about this series. Happy learning.