Advanced Model Attempt #1: Neural-Based Definition Extraction

Published in

NLP Capstone Blog

6 min readMay 3, 2018

We last left off on the idea of using an FSA with a restricted vocabulary; restricted in the sense that we extract sentences coupled with a neural language model to assure semantic quality while allowing a generative RNN model a reasonable amount of improvisation to produce abstractive definitions.

Here, we discuss our approach for the extractive component of this model, and consider it our first attempt at an advanced model for the task.

Introducing Extractive Summarization

Recall that extractive summarization is the idea of reducing text down to a subset of its sentences that still preserves its semantic integrity. In particular, we intend to build on the work of a successful nerual-based extractive summarizer and tailor it to solve our task.

SummaRuNNer is an RNN-based extractive summarization algorithm developed by Nallapati et al. that encodes documents from the word level up to and across the sentence level before making inference. Essentially, the model is a binary classifier on sentences within a document on whether it should be included in a summary. Its decisions are conditioned on

Averaged-pooled word-level hidden states of the sentence
Average-pooled sentence-level hidden states of the document
An abstract representation of the summary built so far (average-pooling of the word-level pooled hidden states of sentences selected thus far)

After which, there are several affine transformations conducive to selecting and filtering sentences:

Content: affine on the abstract sentence representation that measures semantic richness
Salience: bilinear affine on the abstract sentence representation and the document representation to measure cohesiveness
Novelty: bilinear affine on the abstract sentence representation and the running summary representation to address redundancy
Absolute and Relative Positioning: two separate affines on the embedded index of the sentence to allow how far we are into the document to influence inference

As of now, we have built from scratch our own unofficial implementation of this model with inspiration from another unofficial implementation and is capable of summarizing documents the way we’ve formatted them. What’s left is for us to tailor this model to fit the task.

A Slight Twist on an Established Task

As of now, the model summarizes documents. We’d like it so that it instead zeroes in on query terms we give it given a research paper, to intelligently extract only sentences from that paper conducive to defining that term.

Our approach for augmenting SummaRuNNer to be a definition extractor involves

Encoding the query term with a character-level RNN and using its concatenated hidden states as its representation
Introducing this new query-term abstract representation when constructing the document representation through a bilinear affine
Further introducing this query term by converting many of the non-bilinear affines (content, positioning, and possibly new ones for the task) to further condition inference on the query term.

Essentially, the sentences we extract from the document are being conditioned on the term we’re trying to define. Encoding technical terms using a character level RNN allows similar technical terms to have similar hidden representations. For example, if we see the term “Chronic Lymphocytic Leukemia” in the training data and encounter “Chronic Myelogenous Leukemia” in the testing data, we would have more of an idea of how to approach this new term because of its character level similarities to the term we have already seen during training time. This might help us break down more complicated novel technical terms at testing time.

Experiments have yet to be conducted on the effectiveness of this approach but will be discussed later in Advanced Model Attempt #1 (cont.): another post later in this series discussing the results of the groundwork we’ve laid out here.

Training Methods

Collection Training Data with UMLS and ROUGE:

Recall that SummaRuNNer is a model that aims to extract the sentences in a document that summarize it best. It does so by training on examples that teach the model which sentences to extract from the document.

SummaRuNNeR uses a distant supervision method that relies on ROUGE in order to produce training examples for the model. This portion of the architecture, which we refer to as the “extractor”, extracts the sentences out of each document which maximize the ROUGE score when compared against the gold standard definition for the term in question. The extractor in a summarization context can use a greedy approach as follows:

Look at each sentence in the document one at a time and consider appending it to the extracted sentences that we have already chosen.
Calculate the ROUGE score of the old extracted sentences + this new sentence in comparison to the gold standard summarization for the document.
If the ROUGE score increases from the previous ROUGE score, keep the new sentence.
Otherwise, we don’t keep the new sentence and move on.

Although this method may not produce the most optimal and compact set of sentences that are relevant, this approach will be faster and is reasonable. The output of the extractor for each document is a tensor whose length is the number of sentences in the document, and is 0 if the sentence is tagged with O or 1 if the sentence is tagged with I.

To tailor this style of data collection to our task however, we optimize on ROUGE with respect to an entity’s gold-standard definition instead of a gold-standard summarization of the document. We collect entity-definition pairs through UMLS and creating training examples of the form

Entity (the technical term to define)
Gold-standard definition for the entity
The target sentence IO tags found via distant supervision with ROUGE on sentences of a research paper with the gold-standard definition being the reference
A Semantic Scholar research paper in which the sentences came from (provides the sentences in which to perform inference)

With this data, we can train the definition extraction model discussed earlier; we train using these <entity, IO-tagged sentences, publication> examples to learn a tagger that can extract sentences most relevant to a term given a publication.

While this may result in an unnecessarily large number of training data points, we can also consider pruning this dataset later on if we have irrelevant entities for a particular document. For example, if we were trying to find a training example that used the entity “dental cavity” for a document that was about blood cancers, we might not want to keep this training example because there wouldn’t be much of a correlation between the two. In order to do this, we can introduce a ROUGE threshold, where we only keep the training example if the ROUGE score of the sentences extracted by the tagger are above a particular threshold. This might be an optimization for the future.

Our previous approach was unsupervised and it relied only on the semantic scholar dataset to produce definitions. Our current approach is an extension of SummaRuNNer which requires gold standard definitions for entities that we’d like to define in each paper. We chose to focus on medical terms, and one of the most complete datasets for medical terms and their definitions happens to be the UMLS dataset. This dataset contains a Metathesaurus which contains, amongst many other pieces of data, medical terms and their definitions. The technical terms in the dataset serve as references for ROUGE in the tagging phase above.

In summary

Training is fairly straightforward; loss between predicted and target sentences is computed with log loss (each sentence in a document is IO-tagged where sentences labeled with I are to be included in the definition). Essentially, the definition extractor, much like SummaRuNNer, is trained as a sentence tagger.

Attention as a Stretch Goal

The first part of our basic SummaRuNNer-based model uses a document representation to predict tags for sentences in a document. The current document representation is constructed by averaging the hidden states from words in each sentence and averaging the hidden states from each sentence in the document. However, we believe that simply averaging the sentences may not be the best approach to constructing the latent document representation. One of our stretch goals for us to optimize the model will be to attend to the most important parts of sentences in each document. We can do this using the method proposed in Hierarchical Attention Networks for Document Classification (Yang et. al 2016).

This approach introduces a word level context vector and a sentence level context vector which allow us to calculate attention coefficients on the fly for every word in each sentence and every sentence in the document. In this manner, we can take a weighted sum of the hidden states in the sentences and will hopefully produce better document representations overall. The word level and sentence level context vectors can be initialized randomly and learned throughout training.

Conclusion

We are very excited to have found a supervised approach to this task per the advice of AI2 researchers. It’s a straightforward approach with measurable loss and clearer metrics.

We also hope to have enough time before the capstone is over to introduce attention!