Simple Relation Extraction with a Bi-LSTM Model — Part 1
“Let’s schedule a meeting next Friday at our offices in Paris?”. It doesn’t seem like it, but for an NLP-oriented task, such a sentence is full of challenges. In order to correctly make the appointment, we need to understand that “meeting” and “Friday” are related, and to find out the relation between those terms. It could be something like date_of(meeting, Friday), which is read “The date of the meeting is Friday”. Other relations in that sentence would be location_of(offices, Paris) and location_of(meeting, offices).
Overview of the Relation Extractor
Presentation of the Task
Relation Extraction (RE) is the task of finding the relation which exists between two words (or groups of words) in a sentence. We call those words ‘entities’. There are hundreds of relation possibilities between entities of different types, which can be (but are not exhaustively) a person, a location on any scale from the Universe to your desk, a date, an organization or an event. Relations can go from place_of_birth, to cause_effect, passing by place_lived, component_whole or founded_by.
This task is considered to be one of the most difficult in NLP. For a great summary of why Relation Extraction is so hard, I recommend the Introduction and the Related Works sections of this paper by Lin et al, 2016.
Relation Extraction actually involves several subtasks:
- Extraction of entities (by Named Entity Recognition or matching of keywords).
- Detection of the existence of a relation between each pair of entities detected in a sentence.
- Classification to determine what each detected relation is.
In this article, we will focus on relation detection and classification, and we will train a model which can perform both of these tasks simultaneously.
Dataset and Model
The goal here is to classify sentences extracted from New York Times articles based on the relations they express. The datasets used in this article are available here (from SemEval-2010 Task 8) and here (preprocessed version, the original was from Riedel et al., 2010).
We choose to use a Bi-LSTM model, as this type of Recurrent Neural Network has proven itself to be well suited to tasks where remembering long-term dependencies is crucial (see an example here applied to RE). It’s very important to keep track of the past words and grammatical structures to detect long-distance relation patterns. This blog article provides a great explanation on how an LSTM model works (see this answer on StackOverflow for an explanation of the bidirectionality) and this one deals with their implementation with Keras. We implement this model using Keras Sequential API within Tensorflow.
As there is more than one relation in the dataset, we are facing a multi-class classification problem. In this case, global metrics like Accuracy don’t allow us to have a clear understanding of the performance of the model. This is why we privilege confusion matrices and individual Precisions for each class (check out this article for more details on the Accuracy trap and necessary alternatives, like Precision and Recall). Confusion matrices are a good indicator of the behaviour of the model as they should, in the perfect case, have numbers only on the diagonal. Below is an example of confusion matrix of a model with good performances as the matrix is almost empty outside the diagonal.
The perfect matrix is what we’re aiming to achieve here, and we will explore different ways to improve our model to achieve that. The metrics used for our model are the ones provided by sklearn.
Building a First Model
We will first consider the SemEval-2010 dataset. It contains 8000 training sentences and 2717 testing ones, split into 10 classes among which is also the “Other” class. Thus all sentences in the dataset express a unique relation, so our model will only be a relation classifier, not a detector. Here is the repartition over the whole dataset:
Intuitively, it seems that the denominations of some classes are very close, so we can expect them to overlap (even if they are very precisely defined in the annotation guides of the task).
We can then manipulate the data using a pandas.DataFrame with columns “sentences” and “labels”.
The first preprocessing applied here is the removal of the indicators of entities in the sentences (eg. “<e1>”).
Then, we have to transform our sentences into word indices. Each word has a unique index in the vocabulary constructed from the train set. The lists are padded with zeros until a maximum length to ensure that all sentences have the same shape. For example, if those two sentences are extracted from a bigger set, it could give:
All of this is done using a Keras Tokenizer. To determine its parameters, rapid analysis of the corpus reveals that the sentences contain almost 150000 unique tokens (words and punctuation marks), among which 145000 are used more than once, and around 15000 more than 100 times. From this, we set the size of the vocabulary to 20000. Furthermore, the sentences have at most 100 tokens, which allows us to set the maximum length to 100 without losing any information.
We have one last thing to do before moving on to the actual model. Our labels need to be transformed into one-hot encoding. For example, 0 becomes [1, 0, 0, 0, …, 0, 0, 0] and 3 becomes [0, 0, 0, 1, 0, …, 0, 0, 0].
It is done by the following function:
It’s time to split! We set the test set size to 20% of the whole dataset, and fix the random state for the sake of reproducibility.
Our model consists of one Bi-LSTM layer followed by a Dense layer with softmax activation. They are preceded by a Keras Embedding layer. This layer is trained with the model and learns features from the sentences that are relevant for our task. Here is a great response on StackOverflow on the differences between this approach and the Word2Vec model.
We train it with a validation split to detect eventual overfitting:
Evaluation of the Model
As said before, Accuracy alone doesn’t truly reflect the performance of our model, so using the evaluate function isn’t going to be enough. Thus, we use the predict function, which outputs the probability for each class. The index of the maximum corresponds to the actual prediction. The code below displays the confusion matrix and the classification report, which presents the Precision, recall and f1 score for each class in a nice way.
Here is the confusion matrix after training for 3 epochs. Each line corresponds to the true label and each column to the prediction. The colors are linked to the percentage of sentences of a class i being predicted as a class j. The scale bar is then expressed as percentages.
Some classes like Product-Producer (5e column) are never predicted and the lass Other (first column) is overpredicted. The diagonal is slightly visible but it is far from perfect.
To improve the results, we train it longer, with 10 epochs.The outcomes are a bit better but there are still non-negligible numbers (compared to the size of the classes) outside the diagonal, which means that the classes aren’t properly separated.
We could keep going and increase the number of epochs (reasonably to avoid overfitting, a quick test showed that the validation loss stops increasing after 12 epochs) or fine-tune the hyperparameters, but actually, there is one other thing that may prevent the results we’re looking for.
This dataset is rather small to train a neural network (8000 sentences for training). This is a common issue in NLP — hard tasks’ annotations are not obvious and thus are very time-consuming.
Using the New York Times dataset, we have built a Bi-LSTM model to classify relations. Unfortunately, the results are not good enough to use it in a production environment. That’s why we will keep working on it in a second part.