Summary: Building Language Models for Text with Named Entities (ACL 2018)

Sameer Singh
UCI NLP
Published in
3 min readOct 15, 2018

Authors: Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Language models don’t work well with named entities because the vocabulary is fairly open (consisting of many rare/infrequent words). There may not be enough information in the training data to adequately learn how these words. From an example in the paper, for modeling recipes, you might have many different variations of cooking oil, such as canola oil, grape oil (really? sounds tasty!), sunflower oil, and olive oil. Since each of them may not appear many times in the training corpus (grape oil for example!), we probably won’t learn a good model for generating such entities.

However, the entity types are pretty finite, and hence the language model should first just generate the types, and given the entity type, generate the tokens of the entity. This is what the paper proposes. In the above example, this means the model would generate cooking oil as the entity type first (the type model), and then generate the specific entity (the entity composite model). Here’s a figure from the paper, with vegetable and protein as the types, and broccoli and chicken as the entities.

From the original paper (link: https://arxiv.org/abs/1805.04836)

Training assumes that the corpus is typed already, and thus both of these models can be trained independently (on nearly identical data, except that entities are replaced by their types for the type model). The generation combines the two probabilities by multiplying them (slightly more involved since you have to compute the probability the next word is not an entity).

The paper evaluates on two benchmarks. The first is a corpus of recipes where only the ingredients are typed (into 8 super-ingredients) manually using a lexicon based approach. The second is a corpus of open-source Android projects where the types are automatically annotated using the syntax tree. The source code was processed for language modeling to be feasible: only method scopes are modeled, cleaning of string variables and global information, and modeling changes to existing methods. Proposed methods obtain better perplexity compared to complex language models (even with type information), and the authors also include interesting case studies, such as fill in the blank evaluation on the recipes.

Overall, this is a simple yet elegant approach to modeling named entities, and increasingly important as language modeling becomes key for NLP. There are some concerns here about the need to give the types (apart from the annotation effort, what are the right set of types? can this be latent?), multi-token entities (shouldn’t the composite model decide how many tokens?), and other considerations (why should we restrict ourselves to named entities? should training both models be independent?). Recent paper at EMNLP 2018 by Harvard NLP addresses many of these concerns for data to text domain, and extending it to pure language modeling would be quite exciting!

--

--