Convert Transformer Inference Output back to IOB2 Format

3 min readMay 18, 2023

Annotating by hand is time consuming and error prone. We can speed up the process by utilizing Transformer Models to help quickly annotate text.

Simple transformer models built with a small dataset can be extremely helpful when annotating large datasets. This post goes through the process of converting inference output back to IOB2.

Transformer Model

This post is using a custom model previously built in the post Building Custom Named-Entity Recognition (NER) Models — Transformers.

from transformers import pipeline

model_output_checkpoint = '...'

classifier = pipeline(
    'ner', 
    model=model_output_checkpoint,
    aggregation_strategy='simple'
)

Building Custom Named-Entity Recognition (NER) Models — Transformers

Complete walk-through where take our previously annotated dataset from a previous post and use it to finetune a…

medium.com

Classifier Output

[
  {'entity_group': 'TIME', 'score': 0.9888856, 'word': '6 : 51', 'start': 1, 'end': 5},
  {'entity_group': 'PERIOD', 'score': 0.9887093, 'word': '1st', 'start': 8, 'end': 11},
  {'entity_group': 'FORMATION', 'score': 0.98260975, 'word': 'Shotgun', 'start': 14, 'end': 21},
  {'entity_group': 'PLAYER', 'score': 0.9936474, 'word': 'P. Mahomes', 'start': 23, 'end': 32},
  {'entity_group': 'EVENT', 'score': 0.69440436, 'word': 'scrambles', 'start': 33, 'end': 42}, 
  {'entity_group': 'DIRECTION', 'score': 0.88298887, 'word': 'right', 'start': 43, 'end': 48}, 
  {'entity_group': 'TEAM', 'score': 0.97735167, 'word': 'LAC', 'start': 56, 'end': 59}, 
  {'entity_group': 'QUANTITY', 'score': 0.9734075, 'word': '34', 'start': 60, 'end': 62}, 
  {'entity_group': 'QUANTITY', 'score': 0.9110169, 'word': '2', 'start': 67, 'end': 68}, 
  {'entity_group': 'PLAYER', 'score': 0.9935433, 'word': 'S. Joseph', 'start': 76, 'end': 84}, 
  {'entity_group': 'PLAYER', 'score': 0.9919572, 'word': 'K. Van Noy', 'start': 86, 'end': 95},
  {'entity_group': 'PLAYER', 'score': 0.9934915, 'word': 'S. Joseph', 'start': 107, 'end': 115}, 
  {'entity_group': 'TEAM', 'score': 0.97411484, 'word': 'LAC', 'start': 134, 'end': 137}, 
  {'entity_group': 'QUANTITY', 'score': 0.9710606, 'word': '34', 'start': 138, 'end': 140}
]

Convert Response to IOB2

Convert the classifier response to a list of Entity . From here, push the text and entities into the labeler to get the IOB2 format.

This process relies on the extr-ds library (Github Repository).

pip install extr-ds

from typing import Any, Dict, List
import nltk
from extr import Entity, Location
from extr_ds.labelers.iob import Labeler


labeler = Labeler(nltk.tokenize.word_tokenize)

examples = [
    '(6:51 - 1st) (Shotgun) P.Mahomes scrambles right end to LAC 34 for 2 yards (S.Joseph; K.Van Noy). FUMBLES (S.Joseph), and recovers at LAC 34.',
]

threshold = .5
annotations = []
for text in examples:
    entities = []
    for item in filter(classifier(text), lambda r: r['score'] >= threshold):
        location = Location(start=response['start'], end=response['end'])
        entities.append(
            Entity(
                len(entities) + 1,
                label=response['entity_group'],
                text=location.extract(text),
                location=location
            )
        )

    iobs = []
    for grouping in labeler.label(text, entities):
        iobs.append({
            'tokens': [tk.text for tk in grouping.tokens],
            'labels': grouping.labels,
        })

    annotations.append({
        'text': text,
        'iob': iobs
    })

IOB2 Format

[
  {
    "text": "(6:51 - 1st) (Shotgun) P.Mahomes scrambles right end to LAC 34 for 2 yards (S.Joseph; K.Van Noy). FUMBLES (S.Joseph), and recovers at LAC 34.",
    "iob": [
      {
        "tokens": ["(", "6:51", "-", "1st", ")", "(", "Shotgun", ")", "P.Mahomes", "scrambles", "right", "end", "to", "LAC", "34", "for", "2", "yards", "(", "S.Joseph", ";", "K.Van", "Noy", ")", ".", "FUMBLES", "(", "S.Joseph", ")", ",", "and", "recovers", "at", "LAC", "34", "."],
        "labels": ["O", "B-TIME", "O", "B-PERIOD", "O", "O", "B-FORMATION", "O", "B-PLAYER", "B-EVENT", "B-DIRECTION", "O", "O", "B-TEAM", "B-QUANTITY", "O", "B-QUANTITY", "O", "O", "B-PLAYER", "O", "B-PLAYER", "I-PLAYER", "O", "O", "O", "O", "B-PLAYER", "O", "O", "O", "O", "O", "B-TEAM", "B-QUANTITY", "O"]
      }
    ]
  }
]