Hi folks! My name is Anastasiia, and I am a Ph.D. student in computer science at the University of Vienna. Less than a month ago, I pushed my entire Summer of Code in the fantastic program organized by Google. It provides students with an opportunity to experience the reality of working in an open-source organization and making a real contribution. My challenge was to create an entire relation extraction component in DeepPavlov library — and here is the story of what came out of it.
My path in NLP started with applied linguistics and brought me to the computational linguistic master’s program with a minor in computer science at the University of Munich. After trying a bit of this and a bit of that, I found the topic that inspired me the most that time: relation extraction. Primarily I worked on it for my master thesis and developed a system for correcting bias in weak supervision for relation extraction tasks.
In my Ph.D. study, however, my focus entirely switched to weak supervision — supervised training without manually annotated data. Currently, I lead a group developing a modular framework Knodle which aims at making weak supervision accessible for everyone by giving the functionality to annotate the data automatically, improve it (with one of already provided or of your own methods), and train a PyTorch model on it. So, for me, a chief maintainer of a newborn open-source project, it was crucial to look from inside at how the already thriving projects work.
The organization that hosted me was DeepPavlov — the developer behind the open-source Conversational AI stack for developing chatbots and virtual assistants. My first impression from their website was: “Wow, these guys are a real team.” They had a vibe, an attitude to the work, and a passion I truly admired, and that made me want to work with them. Although I have never touched on chatbots, there was a relation extraction project proposed among others — quite a topic of mine, and I decided to apply.
In the beginning of May, I got an acceptance letter, and my coding Summer started. For three months, I worked on the relation extraction for the English and Russian languages. And now, with both of these components already shipped in the DeepPavlov Library, I will talk a bit about the tasks, challenges, and results. Let’s go :)
The primary task was to implement a relation extraction module for both English and Russian languages. The main steps we defined were:
- Design and develop the neural network for the relation extraction;
- Create the relation extraction pipeline and incorporate it into the DeepPavlov Library;
- Choose which datasets the relation extraction models will be trained on and prepare them for training;
- Train reliable RE models for both languages and test them on the different test examples.
As a result, the relation extraction component for both English and Russian languages was included in DeepPavlov Release 0.17.0 and is now ready for your experiments!
Not only the technical side but also entering a new team and working with new people is always quite challenging. My challenges were:
- Learning about the organization;
- Adapting to the organizational shenanigans;
- Balancing my GSoC project with my primary work (apart from other students, I had not got any vacation, but still had my work to be done).
Now is a proper time to thank all guys whom I met for their involvement! They cared for smooth integration into their team and were genuinely supportive and cooperative, which always kept me motivated and passionate.
Demo Time: Relation Extraction for English
Right off the bat, let’s take a look at the result of my Summer of Code: Relation Extraction!
Suppose we have a sentence Barack Obama is married to Michelle Obama, born Michelle Robinson, and let's say we want to know whether there is a relation held between Barack Obama and Michelle Obama/Michelle Robinson.
In this case, we first download the already pre-trained English RE model…
python -m deeppavlov download re_docred
… and then call it on our test sample:
See? The model detects a relation “spouse” and returns it with the corresponding id in Wikidata (“P26”) — that is, obviously, a correct answer.
You can also train your own English RE model with different parameters from scratch — simply use the following command:
python -m deeppavlov train re_docred
Demo Time: Relation Extraction for Russian
Let’s look at a Russian sentence Илон Маск живет в Сиэтле. Let’s say we want to find a relation between Илон Маск and Сиетл. In this case, we first download the already pretrained Russian RE model…
python -m deeppavlov download re_rured
… and call it on our test sample:
The output is a detected relation “место жительства” with the corresponding id in Wikidata (“P551”). Correct again!
In order to train the Russian RE model from scratch, use the following command:
python -m deeppavlov train re_rured
Relation Extraction: Behind The Scenes
So far so good — lets dig into the details of the relation extraction problem itself now.
Formally, relation extraction is a sub-task of information extraction that involves finding and classifying the semantic relations between entities in an unstructured text. For example, knowing that Austral is a domestic airline of Argentina and the sister company of Aerolineas Argentinas, we can find the relation “org:subsidiaries” between these two entities in the following example:
The principal practical applications of relation extraction are collecting the database and augmenting the existing ones. An extensive collection of relational triples (i.e., two entities and a relation hold between them) can be later converted into a structured database of facts about the real world of even better quality than manually created ones. In most conventional applications, text entities between which a relation holds correspond to named entities or underlying entities obtained with coreference resolution.
Pattern-base, unsupervised or supervised?
- Pattern-Based Relation Extraction. The most straightforward approach to extracting the relation from text collection. The main idea of this method is to manually write down a set of patterns, which express different relations and use them to find the relations in text;
- Unsupervised Relation Extraction. A more sophisticated approach that allows extracting the relations held between entities in sentences without using any labeled data, external data sources, or even specified in advance target relations;
- Supervised Relation Extraction. Typically the supervised relation extraction is done by a classifier trained on a relatively extensive annotated training set with known relation labels for each training instance. Later the trained classifier can predict the relation in unseen test samples.
We decided to implement Supervised Relation Extraction as from our point of view it is the most common and well-performed approach today.
Sentence-Level or Document-Level?
The supervised relation extraction algorithms differ in the kinds of samples used for training and testing the classifier.
The most common is the Sentence-Level approach, where the data sample is a single sentence. It usually implies only one relation inside the given sentence, which seems far from real life: most sentences in our everyday speech are used to contain more than one relation indeed. Moreover, such an approach restricts the extracting relations from discourse: if the entities are mentioned in two different sentences, such a model would not detect them.
That is why we decided to realize a Document-Level relation extraction, which, unlike the classic sentence-level approach, allows us to find multiple relations within the same sentence as well as the relations held between entities in different sentences.
To better grasp the idea of how document-level relation extraction works, let’s look at the training data. Here is one data instance of corpus for document-level relation extraction DocRED (by the way we used it to train our model — more details about that are in the next sections).
As you can see the data sample is a text snippet with marked entities and relations held between them. One entity can be mentioned differently in different sentences and also take part in different relational triples. And we want our model to trace them all!
Relation Extraction Challenges
However document-level relation extraction systems encounter different challenges. Some of them are:
- Multi-Entity Problem. In document-level relation extraction it is often the case that one document contains multiple entity pairs with different relations. Ideally a stable relation extraction system should detect and classify all of them at once. Moreover the same entity may be encountered in various forms across the text (e.g., “John Smith”, “Mr. Smith”, “John”, etc.);
- Multi-Label Problem. One entity pair can occur multiple times in the document associated with different relations, in contrast to one relation per entity pair for the sentence-level relation extraction. For example, the entities “John Smith” and “New York” may easily express relations “place of birth” and “place of death” at the same time if some unknown John Smith happened to be born and die in the same city which is New York;
- Negative Samples. A vital part of preparing the data for any relation extraction supervised model training is generating negative training samples, i.e., samples with no relation between entities. Tuning the number of negative samples is usually done with experimentations and often turns out to be a sophisticated task requiring additional attention.
We tuned the number of negative samples in some experiments and overcame the first two challenges with a specific model architecture we will talk about in the next section.
Relation Extraction Model
We propose a new RE model based on the Adaptive Thresholding and Localized Context Pooling. Two core ideas of this model are (logically) Adaptive Threshold and Localized Context Pooling.
- Adaptive Threshold. A learnable entities-dependent threshold replaces the usual global threshold for converting the relation extraction classifier output probability to relation label. In order to learn its value, we introduce it as a class, which we train like all other classes. During prediction the positive classes (i.e., relations held in the sample indeed) are the ones that have higher logits than the threshold class. In contrast, all others are negative (i.e., there are no such relations between the given entities):
- Localized Context Pooling. An additional local context embedding related to both entities enhances the embedding of each entity pair. Such representation attended to the relevant context of the entity pair in the document is helpful to decide the relation for this particular entity pair. To derive the context information we directly use attention heads:
Among the most important adjustments we made to the original ATLOP model is that we are using entities’ NER tags as an additional input in order to increase model’s performance. Thus, the input of our RE model is the following:
- Text document as a list of tokens;
- List of entities positions (i.e., all start and end positions of both entities’ mentions);
- List of NER tags of both entities (we have adopted the entity tags used in DocRED and RuRED for English and Russian models correspondingly, which gives us 6 NER tags for the English model and 29 NER tags for the Russian model. The full lists of NER tags are in the official documentation).
For encoding of input text, we used the BERT base model (uncased).
The output of relation extraction is one or several relations found between the given entities. We adopted the relations from the authors of DocRED and RuRED.
In DocRED there are 97 English relations, such as:
- located in the administrative territorial entity,
- country of citizenship,
- publication date,
The Russian relations are not as fine-grained as English ones and make in total 30 relations, for example:
You can learn more about supported relations and other things in the official documentation. In the English RE model, the output is Wikidata relation id and English relation name. In the Russian RE model, there is also an additional Russian relation name if it is available.
Among the challenges we have already mentioned the crucial one is generation of negative samples. In the current implementation the negative samples are created by randomly selecting two entities in a data sample that are not connected with any relation and claiming that they are not related to each other. It seems to be quite a strong assumption yet it surprisingly works well in most cases. Then the dataset is augmented with a new data sample with these two entities marked as holding “no_relation”.
As a part of relation extraction corpora preprocessing we also added an option to create different amounts of negative samples: the negative samples could be of an equal number, twice as many or thrice as many as positive, or there could be none of them. We obtained the best result with the following proportions: two times more negatives in the training set and the same amount of negatives in the validation and test sets.
We trained the RE model for the English language on DocRED corpus. It was constructed from Wikipedia and Wikidata and is now the most extensive English human-annotated dataset for the document-level RE from plain text. After the different adaptations of the corpus (e.g., moving some part of samples from the validation set to the training set, creation of additional negative samples, etc.), we came up with the following amount of data samples: 130650 training samples, 3406 validation samples, and 3545 test samples.
We trained the RE model for the Russian language on the RuRED corpus based on the Lenta.ru news corpus. The amount of additionally generated negative samples is the same as in the English data, resulting in the following numbers: 12855 training samples, 1076 validation samples, and 1072 test samples.
Conclusion & Future Directions
Now, at the end of this journey, I am extremely satisfied seeing the relation extraction in the newest DeepPavlov release. Indeed, there is still a big room for improvement: increasing the number of NER tags for English and the number of relations for Russian, trying out different model parameters, growing the training corpora with weakly supervised data… Stay tuned :)
It was an incredible Summer of Code, and I never had regrets about entering into this three-month adventure. For everyone who is still in doubts about whether they should apply for Google Summer of Code or not — just do it! It is totally worth it!