Summary: Commonsense for Generative Multi-Hop Question Answering Tasks (EMNLP 2018)

Anthony Chen

Published in

UCI NLP

3 min readNov 26, 2018

[1809.06309] Commonsense for Generative Multi-Hop Question Answering Tasks

Abstract: Reading comprehension QA tasks have seen a recent surge in popularity, yet most works have focused on…

arxiv.org

Authors: Lisa Bauer, Yicheng Wang, Mohit Bansal

Question answering (QA) has become a popular topic of research in the NLP community. With datasets becoming more challenging, multi-hop reasoning and the incorporation of external sources of knowledge are becoming increasingly important.

This paper proposes just that: a multi-hop reasoning model with the ability to incorporate facts drawn from a knowledge base.

Model

The proposed model is displayed in Figure 1. At a high level, the query and context are embedded and passed through k reasoning cells. You can think of each pass through a reasoning cell as a “hop” of reasoning. The output of the reasoning layer is fed into a self-attention layer and then into a generative decoder which generates an answer. The novelty of this paper lies in the construction of the reasoning cells.

As a baseline, the authors propose the Baseline Reasoning Cell which takes the context and query and feeds them into a BiDAF model. The output is an updated representation of the context which can be fed into the next reasoning cell. I found this to be a nice and simple way of making BiDAF, which is a single hop reasoning model, into a multi-hop model.

Fig 3. NOIC: Necessary and Optional Information Cell

To incorporate external facts, the authors augment the Baseline Reasoning Cell with an attention mechanism which attends to commonsense facts retrieved from ConceptNet. They call this new reasoning cell NOIC (Figure 3). The attention mechanism calculates a score between the context representation and each commonsense fact and uses the scores to update the context representation with the commonsense facts. See paper for details on the fact retrieval system.

Results

The authors evaluate on NarrativeQA, a difficult generative QA dataset and get state-of-the-art results. They also provide an enlightening ablation study in Figure 5.

Fig 5. Ablation study on validation set of Narrative QA

Line 1 is their model with the Baseline Reasoning Cell. From here, we can see that using only a single hop of reasoning (k=1) or not using contextualized embeddings (-ELMo) severely hurts performance. Also, adding in commonsense information (+NOIC) adds a nice performance bump.

Conclusion

With the large number of QA models being published every year, having strong baselines is integral. In the single-hop case, BiDAF has served as a strong baseline that is also used in more complex models. This paper proposes a strong model that can be used as a baseline in two important research directions: multi-hop reasoning and the incorporation of facts from knowledge bases.