This aims to be a complete two part walk-through where we start with a dataset, iteratively annotate / label programmatically and finish up with a Relation Extraction (RE) Model.
The first part in this series went through the process of programmatically building out a custom labeled dataset.
We will now fine-tune a transformer using the custom labeled dataset from Part 1 to classify relationships between two named-entities.
Data
In order to keep things simple, we will just be building a model to classify a single relationship — r(“TEAM”, “QUANTITY”). The e1 entity is annotated as <e1:TEAM> and the e2 entity is annotated as <e2:QUANTITY>.
[
{
"sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> 42 for <e2:QUANTITY>5</e2:QUANTITY> yards (A.Hamilton).",
"label": "NO_RELATION",
"definition": "r(\"TEAM\", \"QUANTITY\")"
},
{
"sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> <e2:QUANTITY>42</e2:QUANTITY> for 5 yards (A.Hamilton).",
"label": "is_spot_of_ball",
"definition": "r(\"TEAM\", \"QUANTITY\")"
},
...
]
Libraries
pip install extr-ds
pip install tensorflow
pip install transformers
pip install datasets
pip install evaluate
Global Settings
We will be finetuning the bert-base-cased checkpoint.
epochs = 5
model_checkpoint = 'bert-base-cased'
model_output_checkpoint = 'transformers/nfl_pbp_relation_classifier'
labels = ['NO_RELATION', 'is_spot_of_ball']
num_labels = len(labels)
label2id = { label:i for i, label in enumerate(labels) }
id2label = { i:label for i, label in enumerate(labels) }
Tokenizer and Model Setup
The tokenizer for bert-base-cased does not understand our annotations. By adding them, the tokenizer will now treat them as tokens. When we add tokens, we need to let the model know — resize_token_embeddings.
from transformers import AutoTokenizer, \
TFAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint,
use_fast=True,
truncation=True,
padding='max_length',
)
tokenizer.add_tokens([
'<e1:TEAM>',
'</e1:TEAM>',
'<e2:QUANTITY>',
'</e2:QUANTITY>',
])
model = TFAutoModelForSequenceClassification.from_pretrained(
model_checkpoint,
num_labels=num_labels,
id2label=id2label,
label2id=label2id,
)
model.resize_token_embeddings(len(tokenizer))
Train / Test Datasets
We need to get our JSON file into a format that transformers like. We split the dataset and convert into a Dataset object. From here, we pad and tokenize the instances.
import json
from datasets import Dataset
from transformers import DataCollatorWithPadding
from extr_ds.manager.utils.filesystem import load_document
def get_dataset(tokenizer):
def tokenize_data(item):
return tokenizer(item["text"])
data_collator = DataCollatorWithPadding(
tokenizer,
return_tensors='tf'
)
rels = json.loads(
load_document(os.path.join('4', 'rels.json'))
)
random.shuffle(rels)
data = [
{
'text': row['sentence'],
'label': label2id[row['label']]
}
for row in rels
]
n = len(rels)
split_point = .8
pivot = int(n * split_point)
print('len#:', n, 'pivot:', pivot)
columns = ['attention_mask', 'input_ids']
label_cols = 'label'
train_dataset = Dataset.from_list(data[:pivot])
tf_train_set = train_dataset.map(
tokenize_data,
batched=True
).to_tf_dataset(
columns=columns,
label_cols=label_cols,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
test_dataset = Dataset.from_list(data[pivot:])
tf_test_set = test_dataset.map(
tokenize_data,
batched=True
).to_tf_dataset(
columns=columns,
label_cols=label_cols,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
return tf_train_set, tf_test_set
Callbacks
We need to create a callback to print out the validation set accuracy per Epoch.
import numpy
import evaluate
from transformers.keras_callbacks import KerasMetricCallback
load_accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = numpy.argmax(predictions, axis=1)
return load_accuracy.compute(
predictions=predictions,
references=labels
)
callbacks = [
KerasMetricCallback(
metric_fn=compute_metrics,
eval_dataset=tf_test_set,
)
]
Compile / Fit Model
The default Adam optimizer learning rate is too high. We need to set that lower in order to get better convergence.
import tensorflow as tf
tf_train_set, tf_test_set = get_dataset(tokenizer)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5)
)
model.fit(
x=tf_train_set,
validation_data=tf_test_set,
epochs=epochs,
callbacks=callbacks
)
Save Tokenizer / Model
for model_to_save in [tokenizer, model]:
model_to_save.save_pretrained(model_output_checkpoint)
Invoking Custom Model
from transformers import pipeline
classifier = pipeline(
"text-classification",
model=model_output_checkpoint,
top_k=None
)
examples = [
'(4:11 - 3rd) (Shotgun) K.Murray pass deep middle to Z.Ertz to <e1:TEAM>SEA</e1:TEAM> <e2:QUANTITY>43</e2:QUANTITY> for 32 yards (R.Neal).',
'(15:00 - 3rd) (Shotgun) T.Siemian sacked at <e1:TEAM>CHI</e1:TEAM> 18 for <e2:QUANTITY>-7</e2:QUANTITY> yards (sack split by N.Shepherd and J.Franklin-Myers).'
]
responses = classifier(examples)
print(responses)
[
[
{'label': 'is_spot_of_ball', 'score': 0.9583994746208191},
{'label': 'NO_RELATION', 'score': 0.041600555181503296}
],
[
{'label': 'NO_RELATION', 'score': 0.995627760887146},
{'label': 'is_spot_of_ball', 'score': 0.004372249357402325}
]
]
Code can be found in this rels.py file.