In natural language discourse, speakers and writers often rely on implicit, “common sense” inference to signal the kind of contribution they are making to the conversation, as well as key relationships that justify their point of view. The early AI literature is full of case studies suggesting that this inference is complex, open-ended and knowledge-heavy (e.g., Charniak 1973, Schank and Abelson 1977, Hobbs 1979). However, recent work on discourse coherence offers a different approach. Take the following example from Pitler et al. (Pitler et al, 2008)
Alice thought the story was predictable. She found it boring.
This discourse shows the classic pattern of implicit information. The overall point is that Alice had a negative opinion of the story: the underlying explanation is that the story was not interesting because it had no surprises. But given available lexical resources and sentiment detection methods, we can capture such inferences systematically by recognizing that they follow common general patterns, known as “coherence relations”, and are guided by shallow cues. This has led to an active and successful research approach based on predicting what coherence relations are present in text based on shallow information. These relations are then used to draw inferences from the text. The value of such shallow approaches has been demonstrated in a variety of applications. For example, knowing whether or not a statement is a reiteration or a contradiction of a previous discourse has improved the quality of text summarization models and sentiment analysis classifiers (Wilson, 2005, D. Marcu., 2000). In this blog post, I give an overview of current techniques, emphasizing the value of linguistic resources, and theoretically-motivated representations of coherence in machine learning methods.
The standard methodology for recognizing discourse relations is supervised learning (Feng et al. 2014, Lin et al. 2014). We start from a large labeled dataset, such as the Penn Discourse Treebank (PDTB) (Prasad et al. 2008), and then we learn to predict coherence relations from features present in the data. For instance, Pitler et al. (Pitler et al 2008) introduced a set of linguistically informed features that can successfully mark implicit relations when using a simple binary classifier. One of these features is polarity tags. Through this feature, understanding the sentiment of the two text spans informs us whether or not there is a comparison or an expansion relation.
This can be shown in the Alice example above. Both “predictable” and “boring” belong to the same sentiment class (according to Question Answering Opinion Corpus). This means that the second statement is much more likely to be a reiteration and an expansion of the first statement, instead of attempting to contrast it.
Another useful feature is if the verbs belong to the same Levin Verb Class. If they do, then it is likely that the relation between the two statements is an expansion relation.
As a result of the two consecutive CoNLL Shared Tasks in 2015 and 2016, a large number of deep neural network architectures have been developed for the marking of implicit discourse relations in many languages. These methods range from embedding based neural networks that include dense embeddings (Kim et al., 2016; Zhang et al., 2015b; Ling et al., 2015) to convolutional neural nets that work based on manually specified indicator features (Zhou et al., 2010; Park and Cardie, 2012; Biran and McKeown, 2013; Rutherford and Xue, 2014), as well as more recent adversarial models (Qin et al., 2017).
What is interesting about these methods is that some of the linguistically informed features still perform very well when compared to neural network models. This can be seen in the graph below with two different types of relations: expansion, as explained above, and contingency, which signals that one statement is causally influenced by another. The red and the gray columns show the results reported by Pitler et al. (Pitler et al., 2008) and Qin et al (Qin et al., 2017) respectively.
However, there are three primary drawbacks to supervised approaches:
- Getting annotated data is hard and is human-resource intensive.
- There often exists a class-imbalance problem, where data is not evenly distributed amongst the classes, which can make learning harder. (Solutions for this can be found in (Junyi Jessy Li and Ani Nenkova, 2014))
- The data is often domain-specific and is not transferable.
To get around these issues, Narasimhan (K. Narasimhan and R. Barzilay, 2015) introduced an unsupervised learning approach for marking discourse relations. Their approach is to suggest learning coherence relations from the role that they play in licensing textual inferences. Given a question-answering dataset describing machine comprehension problems, they treat coherence relations between segments as a latent variable — an unobserved attribute of the text interpretation that’s predictable from combinations of surface features.
More flexible learning techniques can also be applied. Jansen et al. (Jansen et al., 2014) introduced a model where they consider both a shallow representation centered around discourse markers and a deep representation based on Rhetorical Structure Theory for a non-factoid answer re-ranking system. Discourse markers, such as “however” and “meanwhile” are used as shallow cues and a discourse parser explores deeper, less obvious relations.
Most of the code and data of the works that are mentioned in this post are available on GitHub. For instance, a useful wrapper for analysing PDTB can be found here.
Now that we have powerful computational tools such as neural networks and great linguistics resources such as verb lexicons that can help in inferring implicit relations, it is time to put all of these together to further improve automatic inferences of text and conversation. More of the “common sense” understanding can be captured by better surveying how inferences are expressed in real texts, with resources like the PDTB, on the one hand, and by drawing on new knowledge sources, such as neural models of semantic similarity, that can suggest plausible connections between descriptions on a larger scale.
Charniak 1973. Jack and Janet in search of a theory of knowledge. IJCAI.
V. W. Feng and G. Hirst. A linear-time bottom-up discourse parser with constraints and post-editing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, pages 511–521, 2014.
Hobbs 1979. Coherence and Coereference. Cognitive Science.
P. Jansen, M. Surdeanu, and P. Clark. Discourse complements lexical semantics for non-factoid answer reranking. In ACL (1), pages 977–986, 2014.
Junyi Jessy Li and Ani Nenkovac, Addressing Class Imbalance for Improved Recognition of Implicit Discourse Relations, Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 142–150, 2014.
Z. Lin, H. T. Ng, and M. Kan. A pdtb-styled end-to-end discourse parser. Natural Language Engineering, pages 151–184, 2014.
D. Marcu. The theory and practice of discourse parsing and summarization. the MIT press. Cambridge., 2000.
K. Narasimhan and R. Barzilay. Machine comprehension with discourse relations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural, Language Processing pages 1253–1262, 2015.
Schank and Abelson 1977. Scripts plans goals and understanding. Laurence Erlbaum.
T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages, 347–354. Association for Computational Linguistics, 2005.
Z. Zhou, Y. Xu, Z. Niu, M. Lan, J. Su, and C. Lim Tan. 2010. Predicting discourse connectives for implicit discourse relation recognition. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pages 1507–1514, Beijing, China, August.