Building Custom Relation Extraction (RE) Models — Part 2

dp
3 min readMay 13, 2023

--

This aims to be a complete two part walk-through where we start with a dataset, iteratively annotate / label programmatically and finish up with a Relation Extraction (RE) Model.

The first part in this series went through the process of programmatically building out a custom labeled dataset.

We will now fine-tune a transformer using the custom labeled dataset from Part 1 to classify relationships between two named-entities.

Data

In order to keep things simple, we will just be building a model to classify a single relationship — r(“TEAM”, “QUANTITY”). The e1 entity is annotated as <e1:TEAM> and the e2 entity is annotated as <e2:QUANTITY>.

[
{
"sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> 42 for <e2:QUANTITY>5</e2:QUANTITY> yards (A.Hamilton).",
"label": "NO_RELATION",
"definition": "r(\"TEAM\", \"QUANTITY\")"
},
{
"sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> <e2:QUANTITY>42</e2:QUANTITY> for 5 yards (A.Hamilton).",
"label": "is_spot_of_ball",
"definition": "r(\"TEAM\", \"QUANTITY\")"
},
...
]

Libraries

pip install extr-ds
pip install tensorflow
pip install transformers
pip install datasets
pip install evaluate

Global Settings

We will be finetuning the bert-base-cased checkpoint.


epochs = 5
model_checkpoint = 'bert-base-cased'
model_output_checkpoint = 'transformers/nfl_pbp_relation_classifier'

labels = ['NO_RELATION', 'is_spot_of_ball']
num_labels = len(labels)
label2id = { label:i for i, label in enumerate(labels) }
id2label = { i:label for i, label in enumerate(labels) }

Tokenizer and Model Setup

The tokenizer for bert-base-cased does not understand our annotations. By adding them, the tokenizer will now treat them as tokens. When we add tokens, we need to let the model know — resize_token_embeddings.

from transformers import AutoTokenizer, \
TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint,
use_fast=True,
truncation=True,
padding='max_length',
)

tokenizer.add_tokens([
'<e1:TEAM>',
'</e1:TEAM>',
'<e2:QUANTITY>',
'</e2:QUANTITY>',
])

model = TFAutoModelForSequenceClassification.from_pretrained(
model_checkpoint,
num_labels=num_labels,
id2label=id2label,
label2id=label2id,
)

model.resize_token_embeddings(len(tokenizer))

Train / Test Datasets

We need to get our JSON file into a format that transformers like. We split the dataset and convert into a Dataset object. From here, we pad and tokenize the instances.

import json
from datasets import Dataset
from transformers import DataCollatorWithPadding
from extr_ds.manager.utils.filesystem import load_document

def get_dataset(tokenizer):
def tokenize_data(item):
return tokenizer(item["text"])

data_collator = DataCollatorWithPadding(
tokenizer,
return_tensors='tf'
)

rels = json.loads(
load_document(os.path.join('4', 'rels.json'))
)

random.shuffle(rels)

data = [
{
'text': row['sentence'],
'label': label2id[row['label']]
}
for row in rels
]

n = len(rels)
split_point = .8
pivot = int(n * split_point)
print('len#:', n, 'pivot:', pivot)

columns = ['attention_mask', 'input_ids']
label_cols = 'label'

train_dataset = Dataset.from_list(data[:pivot])
tf_train_set = train_dataset.map(
tokenize_data,
batched=True
).to_tf_dataset(
columns=columns,
label_cols=label_cols,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)

test_dataset = Dataset.from_list(data[pivot:])
tf_test_set = test_dataset.map(
tokenize_data,
batched=True
).to_tf_dataset(
columns=columns,
label_cols=label_cols,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)

return tf_train_set, tf_test_set

Callbacks

We need to create a callback to print out the validation set accuracy per Epoch.

import numpy
import evaluate
from transformers.keras_callbacks import KerasMetricCallback

load_accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = numpy.argmax(predictions, axis=1)
return load_accuracy.compute(
predictions=predictions,
references=labels
)

callbacks = [
KerasMetricCallback(
metric_fn=compute_metrics,
eval_dataset=tf_test_set,
)
]

Compile / Fit Model

The default Adam optimizer learning rate is too high. We need to set that lower in order to get better convergence.

import tensorflow as tf

tf_train_set, tf_test_set = get_dataset(tokenizer)

model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5)
)

model.fit(
x=tf_train_set,
validation_data=tf_test_set,
epochs=epochs,
callbacks=callbacks
)

Save Tokenizer / Model

for model_to_save in [tokenizer, model]:
model_to_save.save_pretrained(model_output_checkpoint)

Invoking Custom Model

from transformers import pipeline

classifier = pipeline(
"text-classification",
model=model_output_checkpoint,
top_k=None
)

examples = [
'(4:11 - 3rd) (Shotgun) K.Murray pass deep middle to Z.Ertz to <e1:TEAM>SEA</e1:TEAM> <e2:QUANTITY>43</e2:QUANTITY> for 32 yards (R.Neal).',
'(15:00 - 3rd) (Shotgun) T.Siemian sacked at <e1:TEAM>CHI</e1:TEAM> 18 for <e2:QUANTITY>-7</e2:QUANTITY> yards (sack split by N.Shepherd and J.Franklin-Myers).'
]

responses = classifier(examples)
print(responses)
[
[
{'label': 'is_spot_of_ball', 'score': 0.9583994746208191},
{'label': 'NO_RELATION', 'score': 0.041600555181503296}
],
[
{'label': 'NO_RELATION', 'score': 0.995627760887146},
{'label': 'is_spot_of_ball', 'score': 0.004372249357402325}
]
]

Code can be found in this rels.py file.

Photo by Arnold Francisca on Unsplash

--

--