BIO / IOB Tagged Text to Original Text

Jeril Kuriakose
Analytics Vidhya
Published in
3 min readDec 19, 2019

In this post we will see how to convert BIO tagged text to original text. The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no entity / chunk.

The following figure shows how a BIO tagged sentence looks like:

The latest state-of-the-art NLP technique called Bidirectional Encoder Representations from Transformers (BERT) uses BIO tagged information for training. And the output / predictions returned from BERT is BIO tagged. So now we are supposed to convert it back to the original text, and in this post we will see how to do that.

Dependencies

nltk

First let’s import the dependencies:

from nltk import pos_tag
from nltk.tree import Tree
from nltk.chunk import conlltags2tree

Now we create variables to store the sentence and label.

tokens = ['In', 'Beirut', ',', 'a', 'string', 'of', 'officials',
'voiced', 'their', 'anger', ',', 'while', 'at',
'the', 'United', 'Nations', 'summit', 'in', 'New',
'York', ',', 'Prime', 'Minister', 'Fouad', 'Siniora',
'said', 'the', 'Lebanese', 'people', 'are', 'resolute',
'in', 'preventing', 'such', 'attempts', 'from',
'destroying', 'their', 'spirit', '.']
tags = ['O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'B-org', 'I-org', 'O', 'O', 'B-geo',
'I-geo', 'O', 'B-per', 'O', 'B-per', 'I-per', 'O', 'O',
'B-gpe', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O']

First we need to tag each token with supplementary information, such as its part of speech.

pos_tags = [pos for token, pos in pos_tag(tokens)]

Then we convert the BIO / IOB tags to tree

conlltags = [(token, pos, tg) for token, pos, tg in zip(tokens, pos_tags, tags)]
ne_tree = conlltags2tree(conlltags)

Finally we parse the tree to get our original text

original_text = []
for subtree in ne_tree:
# skipping 'O' tags
if type(subtree) == Tree:
original_label = subtree.label()
original_string = " ".join([token for token, pos in subtree.leaves()])
original_text.append((original_string, original_label))
print(original_text)

Output

[('Beirut', 'geo'),
('United Nations', 'org'),
('New York', 'geo'),
('Prime', 'per'),
('Fouad Siniora', 'per'),
('Lebanese', 'gpe')]

The entire code is as follows:

from nltk import pos_tag
from nltk.tree import Tree
from nltk.chunk import conlltags2tree
tokens = ['In', 'Beirut', ',', 'a', 'string', 'of', 'officials',
'voiced', 'their', 'anger', ',', 'while', 'at',
'the', 'United', 'Nations', 'summit', 'in', 'New',
'York', ',', 'Prime', 'Minister', 'Fouad', 'Siniora',
'said', 'the', 'Lebanese', 'people', 'are', 'resolute',
'in', 'preventing', 'such', 'attempts', 'from',
'destroying', 'their', 'spirit', '.']
tags = ['O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'B-org', 'I-org', 'O', 'O', 'B-geo',
'I-geo', 'O', 'B-per', 'O', 'B-per', 'I-per', 'O', 'O',
'B-gpe', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O']
# tag each token with pos
pos_tags = [pos for token, pos in pos_tag(tokens)]
# convert the BIO / IOB tags to tree
conlltags = [(token, pos, tg) for token, pos, tg in zip(tokens, pos_tags, tags)]
ne_tree = conlltags2tree(conlltags)
# parse the tree to get our original text
original_text = []
for subtree in ne_tree:
# checking for 'O' tags
if type(subtree) == Tree:
original_label = subtree.label()
original_string = " ".join([token for token, pos in subtree.leaves()])
original_text.append((original_string, original_label))
print(original_text)

This way we can convert the BIO / IOB tags to the original text.

You can find the repo here.

Happy Coding !!!

--

--