Convert Transformer Inference Output back to IOB2 Format

dp
3 min readMay 18, 2023

--

Annotating by hand is time consuming and error prone. We can speed up the process by utilizing Transformer Models to help quickly annotate text.

Simple transformer models built with a small dataset can be extremely helpful when annotating large datasets. This post goes through the process of converting inference output back to IOB2.

Transformer Model

This post is using a custom model previously built in the post Building Custom Named-Entity Recognition (NER) Models — Transformers.

from transformers import pipeline

model_output_checkpoint = '...'

classifier = pipeline(
'ner',
model=model_output_checkpoint,
aggregation_strategy='simple'
)

Classifier Output

[
{'entity_group': 'TIME', 'score': 0.9888856, 'word': '6 : 51', 'start': 1, 'end': 5},
{'entity_group': 'PERIOD', 'score': 0.9887093, 'word': '1st', 'start': 8, 'end': 11},
{'entity_group': 'FORMATION', 'score': 0.98260975, 'word': 'Shotgun', 'start': 14, 'end': 21},
{'entity_group': 'PLAYER', 'score': 0.9936474, 'word': 'P. Mahomes', 'start': 23, 'end': 32},
{'entity_group': 'EVENT', 'score': 0.69440436, 'word': 'scrambles', 'start': 33, 'end': 42},
{'entity_group': 'DIRECTION', 'score': 0.88298887, 'word': 'right', 'start': 43, 'end': 48},
{'entity_group': 'TEAM', 'score': 0.97735167, 'word': 'LAC', 'start': 56, 'end': 59},
{'entity_group': 'QUANTITY', 'score': 0.9734075, 'word': '34', 'start': 60, 'end': 62},
{'entity_group': 'QUANTITY', 'score': 0.9110169, 'word': '2', 'start': 67, 'end': 68},
{'entity_group': 'PLAYER', 'score': 0.9935433, 'word': 'S. Joseph', 'start': 76, 'end': 84},
{'entity_group': 'PLAYER', 'score': 0.9919572, 'word': 'K. Van Noy', 'start': 86, 'end': 95},
{'entity_group': 'PLAYER', 'score': 0.9934915, 'word': 'S. Joseph', 'start': 107, 'end': 115},
{'entity_group': 'TEAM', 'score': 0.97411484, 'word': 'LAC', 'start': 134, 'end': 137},
{'entity_group': 'QUANTITY', 'score': 0.9710606, 'word': '34', 'start': 138, 'end': 140}
]

Convert Response to IOB2

Convert the classifier response to a list of Entity . From here, push the text and entities into the labeler to get the IOB2 format.

This process relies on the extr-ds library (Github Repository).

pip install extr-ds
from typing import Any, Dict, List
import nltk
from extr import Entity, Location
from extr_ds.labelers.iob import Labeler


labeler = Labeler(nltk.tokenize.word_tokenize)

examples = [
'(6:51 - 1st) (Shotgun) P.Mahomes scrambles right end to LAC 34 for 2 yards (S.Joseph; K.Van Noy). FUMBLES (S.Joseph), and recovers at LAC 34.',
]

threshold = .5
annotations = []
for text in examples:
entities = []
for item in filter(classifier(text), lambda r: r['score'] >= threshold):
location = Location(start=response['start'], end=response['end'])
entities.append(
Entity(
len(entities) + 1,
label=response['entity_group'],
text=location.extract(text),
location=location
)
)

iobs = []
for grouping in labeler.label(text, entities):
iobs.append({
'tokens': [tk.text for tk in grouping.tokens],
'labels': grouping.labels,
})

annotations.append({
'text': text,
'iob': iobs
})

IOB2 Format

[
{
"text": "(6:51 - 1st) (Shotgun) P.Mahomes scrambles right end to LAC 34 for 2 yards (S.Joseph; K.Van Noy). FUMBLES (S.Joseph), and recovers at LAC 34.",
"iob": [
{
"tokens": ["(", "6:51", "-", "1st", ")", "(", "Shotgun", ")", "P.Mahomes", "scrambles", "right", "end", "to", "LAC", "34", "for", "2", "yards", "(", "S.Joseph", ";", "K.Van", "Noy", ")", ".", "FUMBLES", "(", "S.Joseph", ")", ",", "and", "recovers", "at", "LAC", "34", "."],
"labels": ["O", "B-TIME", "O", "B-PERIOD", "O", "O", "B-FORMATION", "O", "B-PLAYER", "B-EVENT", "B-DIRECTION", "O", "O", "B-TEAM", "B-QUANTITY", "O", "B-QUANTITY", "O", "O", "B-PLAYER", "O", "B-PLAYER", "I-PLAYER", "O", "O", "O", "O", "B-PLAYER", "O", "O", "O", "O", "O", "B-TEAM", "B-QUANTITY", "O"]
}
]
}
]
Photo by mauRÍCIO SANTOS on Unsplash

--

--