Building a custom Named Entity Recognition model using spaCy — Data Labeling — Part 1

Johni Douglas Marangon
5 min readNov 21, 2023

--

Welcome to the first post about training a custom Named-Entity Recognition (NER) with spaCy. These series of posts were made to cover all tasks to build an end-to-end application to extract named entities from documents.

These series were written in three parts:

  • Part 1 — Data labeling: getting data with a crawler and annotating training documents with Doccano.
  • Part 2— Train the Model: training a NER model with spaCy.
  • Part 3— Create a e2e application: creating a REST API to make a NER model available.

At the end, you will be able to build any project to recognize entities in text documents. Take a coffee and let’s get started.

Getting the Data

Getting the data refers to the process of obtaining and preparing the dataset that will be used to train, validate, and test a machine learning model. This involves identifying relevant data sources, obtaining the data from these sources, cleaning the data to remove errors and inconsistencies, and formatting the data in a way that is suitable for analysis. The quality and relevance of the data are crucial factors that significantly impact the performance of the resulting model.

Here are some key steps involved in getting the data for machine learning:

  • Data Collection: This is the initial step where raw data is gathered.
  • Data Cleaning: Raw data is often messy and may contain missing values, errors, or outliers.
  • Data Preprocessing/Preparation: This step involves transforming the raw data into a format suitable for training a machine learning model.
  • Data Splitting: The dataset is typically split into training, validation, and test sets.

Getting the right data is a critical aspect of building effective models. The choice of features, the size and quality of the dataset, and the preprocessing steps all play a role in determining the success of a training a model.

The process of getting the data is an iterative one, as data scientists may need to revisit previous steps to address data quality issues or refine the dataset based on their understanding of the problem.

To demonstrate how we can create a e2e NER project I chose to identify two entities in published papers from The Journal of Open Source Software (JOSS). JOSS is a developer-friendly, free, open-access and peer-reviewed journal for research software packages. This academic journal has a formal peer review process.

The entities are:

  • Digital Object Identifier (DOI): is a standardized unique string of letters and numbers used to identify articles, papers or e-books published online.
  • Affiliations: In research papers, affiliations are the institutions where the research was conducted. These institutions should definitely be included in the author information.

The problem that you will solve is to extract the DOI and the list of affiliations in the JOSS research papers in PDF format, as you can see bellow:

Before start getting the data, these are the challenges for this project: 1) The DOI number have a string pattern; 2) The affiliation is a composite value, have more then one word, and start with a numeration. The paper above has two affiliations.

Writing a Crawler to Get the Papers

A crawler, also known as a web crawler or spider, is a computer program that systematically browses the internet in order to index and collect information from websites. It starts by visiting a list of known web addresses and then follows the links on those pages to discover new information.

To extract all PDFs from JOSS we will write a crawler. First of all, install the Python dependencies:

pip install beautifulsoup4 python-slugify pdfminer.six -q

The variable SEED_URL contains a base URL of the JOSS pagination, this value will be used to get the list of papers per page. Pay attention with the const MAX_PAGE, this value contain the total of pages and can be changed to get more documents.

SEED_URL = "https://joss.theoj.org/papers/published?page={page}"

MAX_PAGE = 203

pages = [SEED_URL.format(page=i+ 1) for i in range(0, MAX_PAGE)]

print(len(pages))

The next code is used to find in the raw HTML the paper title and the paper PDF URL. The documents variables contain a list of papers crawled.

import urllib
import io
from slugify import slugify
from bs4 import BeautifulSoup


documents = []

for page in pages:
html_doc = io.BytesIO(urllib.request.urlopen(page).read())

soup = BeautifulSoup(html_doc, 'html.parser')

items = soup.find_all("h2", class_="paper-title")
for item in items:
a_tag = item.find("a", href=True)
if a_tag:
documents.append( (slugify(a_tag.text), a_tag["href"] ))

print(len(documents))
print(documents[0])

Now, we will download the PDF files and extract the text, both will be saved in a folder named docs:

from pdfminer.high_level import extract_text
import os


direcotry = "docs"

os.makedirs(direcotry, exist_ok=True)

for slug, url in documents:
try:
filename_pdf = os.path.join(direcotry, f"{slug}.pdf")
filename_txt = os.path.join(direcotry, f"{slug}.txt")

url_pdf = f"{url}.pdf"

urllib.request.urlretrieve(url_pdf, filename_pdf)

with open(filename_txt, "w") as f:
text = extract_text(filename_pdf)
f.write(text)

except Exception as ex:
print(url, str(ex))

In the directory docs we have all documents crawled in the PDF and txt format. Now, you are able to start the next task: creating a NER annotation for whole documents.

Data Annotation with Doccano

Data annotation is the process of labeling data to make it understandable for machines. It involves adding metadata, tags, or labels to different types of data. Data annotation can be done manually by human annotators, or it can be automated using machine learning algorithms. The quality of data annotation is critical for the success of NER models. If the data is not annotated accurately, the trained models will not be able to learn from the data.

Doccano is an open-source text annotation tool for machine learning practitioners. It allows users to label text data for natural language processing tasks, such as named entity recognition, text classification, and sequence labeling. To get started with Doccano, you can create a project, upload data, and start annotating. You can build a dataset in a few hours.

The simplest way to run Doccano is to use Docker. See the usage section in the official documentation for more ways to start Doccano.

Follow the commands below:

docker pull doccano/doccano

docker container create --name doccano \
-e "ADMIN_USERNAME=admin" \
-e "ADMIN_EMAIL=admin@admin.com" \
-e "ADMIN_PASSWORD=admin" \
-v doccano-db:/data \
-p 8000:8000 doccano/doccano

docker container start doccano

The commanddocker container stop doccano -t 5should be used to stop safely the container to persist all data created.

Docker server is running in http://127.0.0.1:8000/. Access the app and let’s start the annotation data.

Let’s take a look in the video bellow to understand how to load the raw documents, create the labels, made the annotation and export data in a jsonl format in Doccano.

https://youtu.be/9TNaRTCqeZk

In real word projects this task take a long time to do. I created a complete dataset to use in the next article.

Closing Remarks

In this post, I covered all the steps involved in building a consistent dataset to train a custom NER.

Keep in your mind, the quality data is a requisite to build a successful NER. In my opinion, creating the dataset is the most important step and requires full attention during the annotations.

In the next post we are starting the training step.

If you enjoy this content 👏 clap and follow me 🎯. Happy codding.

--

--