Building Custom Named-Entity Recognition (NER) Models

dp
6 min readApr 29, 2023

--

Complete walk-through where we start with a dataset, iteratively annotate programmatically and finish up with a CRF Model.

Labeling datasets can be very expensive and slow. Lots of different products exist to help facilitate the process from manual point and click to complete outsourcing.

This post aims to provide an alternative process to the latter in order to speed up labeling and model building for real world problems.

The complete project can be found at this repository.

Workflow / Process

  1. Randomly select a small subset of instances
  2. Programmatically annotate subset
  3. Inspect annotations
  4. Save data for our model
  5. Repeat

This entire process will be managed through the command line using the extr-ds library (Github Repository).

pip install extr-ds

Project Setup

extr-ds --init

This command will create a number of directories / files (see image below).

  • extr-config.json: contains configurable settings
  • labels.py: will contain all of our custom label rules.
  • utils.py: a couple of methods to further interact with the text. example — clean/tokenize/etc.
  • directories: 1–4 will manage the data as we iterate through the workflow/process.
workspace

Source File

After initializing our workspace, you will need to add your data to /1 directory (see image below). By default, the process expects source.txt but that can be changed in the extr-config.json file.

Visual Studio Code

For this example, I used play-by-play data scraped from ESPN. The pbp repository comes setup with this dataset.

(12:52–2nd) T.Pollard left tackle to PHI 44 for -4 yards (M.Williams).

labels.py

For my play-by-play dataset, I leveraged a knowledge base and pattern matching.

  • kb: a dictionary of terms where the key is the label.
  • entity-patterns: more complex labeling that relies on regular expressions / pattern matching.
kb = {
'PERIOD': [
'1st',
'2nd',
'3rd',
'4th',
'OT',
],
'TEAM': [
'ARZ',
'Arizona',
'ATL',
'Atlanta',
'BLT',
...
],
}

entity_patterns = [
RegExLabel(
label='TIME',
regexes=[
RegEx(expressions=[
r'\b[0-9]{1,2}:[0-9]{2}\b',
]),
],
),
...
]

utils.py

For my play-by-play dataset, I wanted to use the nltk tokenizer.

I also needed to add a few transformers to clean up some bad text. That can be done through the transform_text method.

from nltk.tokenize import word_tokenize


def word_tokenizer(text: str) -> List[str]:
return word_tokenize(text)

def transform_text(document: str) -> str:
return document.strip()

Split / Annotate the Data

To start the workflow / process, run

extr-ds --split

This will partition our source.txt file into dev.txt and holdouts.txt files found in the /2 directory, where dev.txt will contain a small number of instances (configurable, see extr-config.json).

Additionally, this command will also annotate the dev.txt file. The output from annotating can be found in the /3 directory.

  • dev-ents.txt: xml annotations.

(<TIME>10:40</TIME> — <PERIOD>3rd</PERIOD>) <PLAYER>S.Darnold</PLAYER> <EVENT>pass incomplete</EVENT> <DIRECTION>short right</DIRECTION>.

  • dev-ents-redacted.txt: The instance with the entity redacted.

( — ) .

  • dev-ents.html: html page to view labeled entities in a more natural way.
  • dev-ents.stats.json: contains all of the parsed entities, key=label, value=list of text that was classified as that label.
{
"QUANTITY": [
"-3",
"-4",
...
],
}

Inspect

This is pretty straight forward if you use an IDE (Visual Studio Code). Open dev-ents-redacted.txt and dev-ents.txt in two editor views. This allows you to quickly see what was annotated and what is left in case we need to refine our labeling rules. Overtime, dev-ents-redacted.txt will show fewer lines as we collect data as similar outcomes are filtered out.

Labels vs Leftover Text

Additionally, you can view dev-ents.html for a more natural view (see below). For custom styles, add a styles.css file in the project root.

.lb-PLAYER { background-color: lightblue }

dev-ents.html

Save

If everything looks good,

extr-ds --save -ents

This appends what we just inspected to a final file — ents.txt found in the /4 directory. This file represents a collection of instances that we think our labeling engine has correctly annotated and will be used to build the CRF Model.

It also appends the redacted text to ents-redacted.txt which is just a collection of templates that is used to help avoid seeing the same thing in our dev-ents-redacted.txt file.

If during inspection you notice an issue. Just update your rules in the labels.py file and re-annotate.

extr-ds --annotate -ents

This will run our labeling rules again and refresh our files. If we need to start over,

extr-ds --reset

This will clear out the /4 directory.

Iterate

We can repeat our the process to build up our dataset of instances that can be annotated.

  1. Split / Annotate the Data
  2. Inspect / fix issues
  3. Save the examples

Build the Model

To keep things simple, I used the setup found in this tutorial for the CRF model. A bit of work is needed to get the dataset in the correct format (see below).

import os
from nltk import pos_tag
from extr_ds.manager.utils.filesystem import load_document


def make_crf_dataset():
records = json.loads(
load_document(os.path.join('4', 'ents-iob.json'))
)

train_set = []
for record in records:
tokens = record['tokens']
labels = record['labels']

train_set.append(
list(
zip(
tokens,
list(map(lambda a: a[1], pos_tag(tokens))),
labels
)
)
)

return train_set

Once you have the dataset, the rest just flows with the crf tutorial.

import sklearn_crfsuite

from models.features import sent2features, sent2labels
from sklearn.model_selection import train_test_split

X = [sent2features(s) for s in train_sents]
y = [sent2labels(s) for s in train_sents]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15)

crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)

crf.fit(X_train, y_train)

Finishing it up, we can print out the differences between the actual and predicted labels for the test set.

y_test_pred = crf.predict(X_test)

print()

for i, outcomes in enumerate(zip(y_test_pred, y_test)):
differences = check_for_differences(outcomes[1], outcomes[0])
if differences.has_diffs:
for diff in differences.diffs_between_labels:
print(X_test[i][diff.index]['word.lower()'], '-', diff.diff_type)
print('s1:', outcomes[1][diff.index], 'vs s2:', outcomes[0][diff.index])
print()
Differences between actual and predicted

Next?

--

--