Building Custom Relation Extraction (RE) Models — Part 1

dp
4 min readMay 4, 2023

--

This aims to be a complete two part walk-through where we start with a dataset, iteratively annotate / label programmatically and finish up with an LSTM Relation Extraction (RE) Model.

This will be divided into two posts where this post will only go through the process of building out a custom labeled dataset and ultimately end up with a Rule-Based Relation Extraction Model. The second part will take the file created here and fine-tune a transformer model to classify the relationship.

Workflow / Process

  1. Identity Named-Entities
  2. Programmatically classify relationship
  3. Inspect classifications
  4. Save dataset
  5. Repeat

This entire process will be managed through the command line using the extr-ds library (Github Repository).

pip install extr-ds

Entities

Before we can classify a relationship, we need entities. Labeling/Extracting entities was previously covered in another post. That prior process allowed us to build a decent Rule-Based Named-Entity Recognition (NER) model that we will leverage here.

Define Relationships

r(e1, e2) == <label> where e1, e2 are entities.

extr-config.json

  • each instance we try to label could return many multiples back so it is recommended to keep the amount we observe per round low.
{
...,
"split": {
"amount": 5
},
...,
}

labels.py

In the same file where we specified our entity patterns, we will aslo setup our relationships.

  • relation_defaults — This list of tuples specifies which e1 and e2 labels go together and what label to apply when both exist but a relationship was not determined. Only the relationships in this list will be labeled. It may make sense only having one active at a time by commenting out the ones you are not actively working on.
relation_defaults: List[Tuple[str, str, str]] = [
## (e1, e2, label)
('PERIOD', 'TIME', 'NO_RELATION'),
('TEAM', 'QUANTITY', 'NO_RELATION'),
]
  • relation_patterns — This list specifies the search patterns between e1, e2 and what to call that relationship if found. ie. r(‘PERIOD’, ‘TIME’) = ‘is_at’.
relation_patterns: List[RegExLabel] = [
RegExRelationLabelBuilder('is_at') \
.add_e2_to_e1(
e2='TIME',
relation_expressions=[
r'(\s-\s)',
],
e1='PERIOD'
) \
.build(),
RegExRelationLabelBuilder('is_spot_of_ball') \
.add_e1_to_e2(
e1='TEAM',
relation_expressions=[
r'\s+',
],
e2='QUANTITY',
) \
.build()
]

Classify Instances

Similar to building NER datasets, run the — split command to start. This will split, annotate and label a small subset of data. All output can be found in the /3 directory.

extr-ds --split
  • dev-rels.json — json dataset of annotations and labels. e1, e2 are annotated in the sentence to mark which entities we want to classify.
{
"sentence": "(<e2:TIME>0:24</e2:TIME> - <e1:PERIOD>3rd</e1:PERIOD>) (No Huddle, Shotgun) PENALTY on ARZ - D.Williams, False Start, 5 yards, enforced at ARZ 30 - No Play.",
"label": "is_at",
"definition": "r(\"PERIOD\", \"TIME\")"
},
  • dev-rels.html — html page for a more natural way to inspect the outcomes.
dev-rels.html

Inspect Classifications

The easiest way to do this is to view the dev-rels.html file in Visual Studio Code / browser, similar to entities in the previous post.

dev-rels.html

During inspection, you will likely come across mislabeled examples (see above). In the case above, you notice that row #23 should be ‘is_at’ instead of ‘NO_RELATION’. To fix this, we can either update our rules in labels.py and run the — annotate command or we can update the label through the command line.

extr-ds --relate -label is_at=23,25
dev-rels.html after label fix

To ignore a row,

extr-ds --relate -delete 0,3,6

To undo the delete,

extr-ds --relate -recover 0,3,6

To reset after rule changes,

extr-ds --annotate -rels

Save Data

When everything looks fine,

extr-ds --save -rels

which will append what we just inspected to rels.json in the /4 directory. If the same instance comes in but is labeled differently, a message will log out and the instance will be ignored.

At this point, if you iteratively updated your labels.py file, you may have ended up with a pretty decent Rule-Based Relation Extraction model.

In the next post, we will go over fine-tuning a transformer model to classify the relationships between specific entities using the dataset we just built.

--

--