Building Custom Relation Extraction (RE) Models — Part 2

3 min readMay 13, 2023

This aims to be a complete two part walk-through where we start with a dataset, iteratively annotate / label programmatically and finish up with a Relation Extraction (RE) Model.

The first part in this series went through the process of programmatically building out a custom labeled dataset.

We will now fine-tune a transformer using the custom labeled dataset from Part 1 to classify relationships between two named-entities.

Data

In order to keep things simple, we will just be building a model to classify a single relationship — r(“TEAM”, “QUANTITY”). The e1 entity is annotated as <e1:TEAM> and the e2 entity is annotated as <e2:QUANTITY>.

[
  {
    "sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> 42 for <e2:QUANTITY>5</e2:QUANTITY> yards (A.Hamilton).",
    "label": "NO_RELATION",
    "definition": "r(\"TEAM\", \"QUANTITY\")"
  },
  {
    "sentence": "(1:04 - 3rd) (Shotgun) M.Jones pass short left to T.Thornton to <e1:TEAM>NE</e1:TEAM> <e2:QUANTITY>42</e2:QUANTITY> for 5 yards (A.Hamilton).",
    "label": "is_spot_of_ball",
    "definition": "r(\"TEAM\", \"QUANTITY\")"
  },
  ...
]

Building Custom Relation Extraction (RE) Models — Part 1

Complete 2-part walk-through where we take a dataset, iteratively annotate/label programmatically and build a Relation…

medium.com

Libraries

pip install extr-ds
pip install tensorflow
pip install transformers
pip install datasets
pip install evaluate

Global Settings

We will be finetuning the bert-base-cased checkpoint.


epochs = 5
model_checkpoint = 'bert-base-cased'
model_output_checkpoint = 'transformers/nfl_pbp_relation_classifier'

labels = ['NO_RELATION', 'is_spot_of_ball']
num_labels = len(labels)
label2id = { label:i for i, label in enumerate(labels) }
id2label = { i:label for i, label in enumerate(labels) }

Tokenizer and Model Setup

The tokenizer for bert-base-cased does not understand our annotations. By adding them, the tokenizer will now treat them as tokens. When we add tokens, we need to let the model know — resize_token_embeddings.

from transformers import AutoTokenizer, \
                         TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(
  model_checkpoint,
  use_fast=True,
  truncation=True,
  padding='max_length',
)

tokenizer.add_tokens([
  '<e1:TEAM>',
  '</e1:TEAM>',
  '<e2:QUANTITY>',
  '</e2:QUANTITY>',
])

model = TFAutoModelForSequenceClassification.from_pretrained(
  model_checkpoint,
  num_labels=num_labels,
  id2label=id2label,
  label2id=label2id,
)

model.resize_token_embeddings(len(tokenizer))

Train / Test Datasets

We need to get our JSON file into a format that transformers like. We split the dataset and convert into a Dataset object. From here, we pad and tokenize the instances.

import json
from datasets import Dataset
from transformers import DataCollatorWithPadding
from extr_ds.manager.utils.filesystem import load_document

def get_dataset(tokenizer):
  def tokenize_data(item):
      return tokenizer(item["text"])

  data_collator = DataCollatorWithPadding(
    tokenizer,
    return_tensors='tf'
  )

  rels = json.loads(
    load_document(os.path.join('4', 'rels.json'))
  )

  random.shuffle(rels)

  data = [
    {
      'text': row['sentence'],
      'label': label2id[row['label']]
    }
    for row in rels
  ]

  n = len(rels)
  split_point = .8
  pivot = int(n * split_point)
  print('len#:', n, 'pivot:', pivot)

  columns = ['attention_mask', 'input_ids']
  label_cols = 'label'
  
  train_dataset = Dataset.from_list(data[:pivot])
  tf_train_set = train_dataset.map(
    tokenize_data,
    batched=True
  ).to_tf_dataset(
    columns=columns,
    label_cols=label_cols,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
  )

  test_dataset = Dataset.from_list(data[pivot:])
  tf_test_set = test_dataset.map(
      tokenize_data,
      batched=True
  ).to_tf_dataset(
      columns=columns,
      label_cols=label_cols,
      shuffle=True,
      batch_size=16,
      collate_fn=data_collator,
  )

  return tf_train_set, tf_test_set

Callbacks

We need to create a callback to print out the validation set accuracy per Epoch.

import numpy
import evaluate
from transformers.keras_callbacks import KerasMetricCallback

load_accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = numpy.argmax(predictions, axis=1)
  return load_accuracy.compute(
    predictions=predictions,
    references=labels
  )

callbacks = [
  KerasMetricCallback(
      metric_fn=compute_metrics,
      eval_dataset=tf_test_set,
  )
]

Compile / Fit Model

The default Adam optimizer learning rate is too high. We need to set that lower in order to get better convergence.

import tensorflow as tf

tf_train_set, tf_test_set = get_dataset(tokenizer)

model.compile(
  optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5)
)

model.fit(
  x=tf_train_set,
  validation_data=tf_test_set,
  epochs=epochs,
  callbacks=callbacks
)

Save Tokenizer / Model

for model_to_save in [tokenizer, model]:
  model_to_save.save_pretrained(model_output_checkpoint)

Invoking Custom Model

from transformers import pipeline

classifier = pipeline(
  "text-classification", 
  model=model_output_checkpoint,
  top_k=None
)

examples = [
  '(4:11 - 3rd) (Shotgun) K.Murray pass deep middle to Z.Ertz to <e1:TEAM>SEA</e1:TEAM> <e2:QUANTITY>43</e2:QUANTITY> for 32 yards (R.Neal).',
  '(15:00 - 3rd) (Shotgun) T.Siemian sacked at <e1:TEAM>CHI</e1:TEAM> 18 for <e2:QUANTITY>-7</e2:QUANTITY> yards (sack split by N.Shepherd and J.Franklin-Myers).'
]

responses = classifier(examples)
print(responses)

[
  [
    {'label': 'is_spot_of_ball', 'score': 0.9583994746208191},
    {'label': 'NO_RELATION', 'score': 0.041600555181503296}
  ],
  [
    {'label': 'NO_RELATION', 'score': 0.995627760887146},
    {'label': 'is_spot_of_ball', 'score': 0.004372249357402325}
  ]
]

Code can be found in this rels.py file.