Advanced Model Update: From Definition Extraction to Entity Discovery

Published in

NLP Capstone Blog

5 min readMay 23, 2018

Over the past few weeks, developing a dataset to test our model and flesh out this novel task has proven to be a difficult task in itself. Here, we discuss what worked, what didn’t work, and how the development of our dataset has influenced our perspective of the task, and ultimately, what we will now expect out of our advanced model.

To recap, we began making preliminary version of our dataset using ROUGE, cosine similarity, and skip-bigrams. In particular, given a definition-document pair, we aimed to extract sentences from the document that were most conducive to describing and re-creating the definition. From there, our model can compute latent representations of the term, sentences, and document as a whole, in order to learn how to extract these sentences we’ve chosen.

The Heuristics that Failed

Unfortunately, we can’t all be winners.

Cosine Similarity

We used spaCy’s implementation of cosine similarity using context vectors. Admittedly, cosine similarity does a great job of ruling out sentences that have nothing to do with the current definition when extracting. However, there was much too little variation in the scores that sentence-definition pairs received. Often, they would range from 0.80 to 0.94, and tended to cluster around 0.83–0.87 and 0.90 to 0.93. Not only are so many sentence-definition pairs scoring so highly, it becomes a very fine line between what we should keep and what we shouldn’t.

We attempted to be extremely strict and only keep pairs that scored 0.94; but this often led to many great pairs being ruled out. We speculate that medical language in general tends to cluster together with respect to the rest of the vocabulary in which spaCy’s word vectors were trained. We also speculate that being a bag-of-words method in defining similarity, much of the richness in context and order that makes differences and similarity obvious at a glance are washed away. Despite the method clearly being able to separate contrived sentence-definition pairs, in the landscape of our data, it fails to draw the line the way we’d like it to.

Skip-bigrams

This heuristic in particular was troublesome in that, many UMLS definitions were 1–2 sentences long, while others were several paragraphs. So the idea of using the number of overlapping skip-bigrams between a sentence-definition pair will severely punish shorter glosses. Because of this, longer definitions having little relevance to a sentence may likely still match to it.

The Heursitics that Worked

After several attempts at tuning the above heuristics, we decided to look for more. The following heuristics are how our final dataset will be constructed.

Google’s Top 10,000 Words

There’s currently a repository containing the top 1000 and top 10000 words according to n-gram frequency analysis of Google’s Trillion Word Corpus. In particular we are using the no swears list.

Given the roughly 800,000 glosses that UMLS provides us, we shave this down to roughly 165,000 by removing all definitions that contain an synonym that is also contained within the Google no-swears top 10k list. This drastically reduces our search space when creating examples, and ultimately we are okay with it since common words are trivial to define.

First 15%

Given a definition-document pair, only attempt extraction if at least one of the aliases (synonyms) that the definition defines occurs within the first 15% of the sentences.

Word Embeddings

Following the first two heuristics, given definition-document pairs that make it through these filters we then extract all sentences containing any of the alias for the document.

We then calculate a similarity score between each sentence and the gold standard definition of the term. We do so by using pre-trained word vectors, namely Glove vectors, to better represent sentences. Each sentence is represented as the average of all its word vectors and similarity is defined as the Euclidean distance between the gold standard vector and the sentence vector. Given these distances, we sort them and choose the smallest 5 sentences if there are that many. We believe that through this, we are using better representations of sentences as opposed to the heuristics we tried previously. Although we did consider training our own set of word vectors (the large size of the Semantic Scholar corpus would allow us to do this), we felt that given the time constraints, Glove vectors were sufficient for now.

We then filter out the document to include only mentions of the entity that we are trying to extract (or its aliases). This approach is made possible by UMLS pairing all definitions with all of the aliases that it defines.

After choosing the sentences for each term-document pair, we then incorporate aliases when creating the final training examples. Recall that each training example includes a term, a gold standard definition, sentences within the document, and the target vector. In order to encourage the model to associate synonyms with each other, we can swap out the terms and its aliases in the target sentences, randomly inserting an alias or the term in places that another alias or term might be. This will not only give us a way to produce more training examples, it will also help the model understand the contexts of similar words, which might help it discover entities.

Reframing the Problem to Entity Discovery

Originally, our task was to generate definitions of entities consistent with our corpus. We then reframed the task as an extractive process.

The dataset described above however, will allow us to solve a task that could be described as a ‘superclass’ of definition extraction, which essentially aims to extract all sentences relevant to an entity as opposed to only sentences that help define it. We call it ‘Entity Discovery’, a term coined by AI2 when they originally proposed this type of task during the early stages of the capstone. Given our ranking scheme, we will still tend towards selecting sentences conducive to definitions, but we’re not quite confident enough that every sentence our heuristics will choose resemble a definition or add to one.

Rather, we now see the potential of our model (which, given this dataset, does not have to change at all!). This new dataset will allow us to train a model to learn latent representations of queries and map them to latent representations of sentences. Note that often times, chemical and medical terms have numerous aliases that take different but systematic forms. Given a reasonably trained model on such data, it should theoretically generalize to novel terms and learn what synonyms would look like and the contexts in which they would appear, ones that have not been added to KBs yet, aiding researchers and medical students to learn about ill-defined terms that have synonyms and reference that have not yet been fully documented.

We call this a ‘superclass’ of definition extraction presumably because, if we were successful at extracting all sentences pertinent to an entity, that definition extraction would simply take a subset of these sentences.

In Conclusion

It has taken some time and experimentation to find our footing in creating this dataset, but we’ve found it. The scripts are running, the data looks reasonable, and we are excited to finally see how our model will perform.