Simple Relation Extraction with a Bi-LSTM Model — Part 2

This article is the final part of a two steps tutorial on Relation Extraction for NLP. You can find the first part here.

Marion Valette
southpigalle
7 min readJun 7, 2019

--

Photo by Mike Enerio on Unsplash

In this second part, we will continue the development of a relation extractor. Remember, in the first part, we defined a Keras Bi-LSTM model to classify sentences from the New York Times. The results weren’t very good, as the dataset was really small. Thankfully, we found a second dataset aggregated by Riedel et al. (2010) with different relations but a lot more data (more than 600,000 sentences).

Second Model

Dataset Overview

As for the first dataset, it contains sentences from NYT articles, and — more importantly, it contains a huge amount of sentences with entities that aren’t related.
We use the processed version available on the github page of the RESIDE project (another model of RE). Some file manipulations are necessary to extract the sentences (encoded with word indices that can be reversed) and the position of the entities into a pandas.DataFrame. After dropping exact duplicates, here is the repartition over the whole dataset. 43 classes have less than 5000 examples, so they are grouped into the artificial class Other.

Number of examples in each class

The scale is logarithmic as the differences between the classes are important. Here the entities aren’t explicitly annotated in the text, but their positions in the sentence are provided.

Training

We apply to these data the Bi-LSTM model from Part 1 with a restricted number of the sentences without a relation, to minimize their impact. As a first try, we choose to sample this class to 100,000 examples and keep the other classes unchanged. We won’t consider the positions, for now, so the model doesn’t know which words in the sentence are the supposedly related entities.

Classification matrix with a small sample of No relation examples

The class Contains (2nd column) seems to add noise to the data as every class has a lot of sentences predicted as Contains. It can be explained by the fact that it encompasses some other classes like Administrative divisions and Capital. We, therefore, chose to remove this class from the dataset. Furthermore, some classes appear to overlap, like Capital and Administrative divisions. Lots of Capital examples (4th row) are predicted as Administrative divisions (7th column). When we take a look at the sentences in those two classes, they seem very close as a lot of sentences in the class Administrative divisions link a country and its capital.

Examples of sentences

Capital:
1. much of iraq is out of control, including most of baghdad.
2. harrison freeman matthews jr. was born on dec. 31, 1927, in
bogotá , colombia.

Administrative divisions:
1.
beijing’s embassy in nairobi arranged for her to visit china to attend zheng he celebrations.
2. ms. ahn arrived in new york from
seoul, south_korea, when she was a teenager.

In the final model, these two classes will be concatenated under Administrative divisions.

Impact of an Unbalanced Dataset

As a second experiment, we apply the model to the initial classes without the sampling we’ve just done before to see the effects of an unbalanced dataset.

Classification matrix with all sentences

We see that the confusion matrix seams “cleaner” (with more zeros and small values outside the diagonal).

A Better Metric

To measure the cleanliness of the matrix, we can’t base ourselves on the standards metrics as they are biased by the overrepresentation of one class. To bypass this, we compute a new Precision for each class, which doesn’t include the No relation class. Here’s the formula for each class i:

The more accurate these Precisions, the better the model.
The overrepresentation of the class No relation introduces a bias, enhancing the Precision of the other classes. In other terms, it means that if the model predicts a relation for an entity pair, there are good chances that this relation is correct. In the best case, the predicted output is the right class, and in the worst, it is No relation so the model is not misleading us with a wrong relation.

Final Results

As the concatenation of the classes creates duplicates, we remove them before training. Here is the new repartition of the classes:

Final repartition of the examples

And here are the final results with the reorganization of the classes and an overrepresentation of sentences with no relation.

Below you can see the adapted Precisions for each class.

For the classes Neighborhood of and Administrative divisions, results are very meaningful, as both rows of the confusion matrix are almost completely white. If our classifier predicts one of those classes, we can be sure that it is right. We’re making progress!

A New Model with the Positions

Relative Positions

To further improve the model, we can add to it the information about the positions of the entities. In the dataset, some sentences are already duplicated with different pairs or entities, or with different labels for the same pair. The or isn’t exclusive, as some sentences present both configurations. Adding the position might help classify those examples.

Two choices are available to incorporate the position in the model: either explicitly marking the entities in the text with tags like the ones in the first dataset (“<e1>” and “</e1>” for example) or using the relative positions. In this case, for each sentence, we create two lists of distances of each word to each entity, and we concatenate them to the sentence embedding (like here).

In the first case, the model doesn’t change, and the results are similar to the one obtained before.
The second is more interesting, as it demands some modifications.

The relative position of each word with respect to each entity is calculated from the absolute positions indicated in the dataset. Each list is padded to the same maximum length as the sentences and is split into train/test subset (with the same division as before). Beware of the offsets induced by the filtering of the punctuation during preprocessing, in the original dataset each sign counts as a word.

The images below show the differences between absolute and relative positions.

Functional Model

In order to concatenate the words embeddings and the relative positions, we have to move to a Keras functional model. Here, we manually define the inputs and their shapes, and each layer is a function of the previous one. Once the model instantiated, the training and prediction phases are the same as with the Sequential API. For more details, you can read a great tutorial here on the possibilities offered by this API.

The first input is the Word Embeddings:

The second and third are the Positions:

And then we get the model:

We train and predict:

The matrix below shows the results.

And the Precisions:

This model yields high Precisions for all classes, meaning that we can trust its predictions. It also has a better recall than the previous model, because fewer sentences than before are predicted as No relation. Consequently, it is more likely to give useful information about an incoming sentence in a production environment.

Conclusion

This relation detector and classifier works well on those sentences from New York Times articles, but generalization testing on different types of sentences may not yield the same results. This is a common issue with supervised learning, as production use is limited by the chosen dataset. One solution could be to train a binary classifier to detect the presence of a relation in a sentence, followed by a second model which will tell us which relation it is, based for example on the entities types (more details can be found here). But alas, the quest for the better production-fit relation extractor continues!

Feel free to comment with any thoughts you might have on the BI-LSTM Relation Extraction model, and if you’ve had any ideas for improving the results!

--

--