An Annotated Reading List of Conversational AI

Haixun Wang
AI Graduate
Published in
26 min readApr 6, 2018
Think more. Speak less.

A lot has been written about conversational AI, and a majority of it focuses on vertical chatbots, messenger platforms, business trends, and startup opportunities.

This annotated reading list, on the other hand, consists mostly of academic papers, so of course a majority of them are about using deep learning and other advanced machine learning techniques for conversation.

AI’s capability of understanding natural language is still limited (see my previous post). As a result, creating fully-automated, open-domain conversational assistants has remained an open challenge. It should not be surprising that most of the techniques proposed in the listed papers are not mature enough for commercial use.

Nevertheless, these work could serve as great starting points for people who are seeking the next breakthrough in conversational AI. On the other hand, we want to remain alert, and never cease questioning whether existing efforts and directions will eventually give AI the real capability of conversation.

Challenges and opportunities

Conversation assistants are not new. The field of spoken dialogue systems (SDS) has a long history and is featured in scientific conferences such as SIGDial and Interspeech. A spoken dialogue system typically uses a manually created tree-like structure to model turn-by-turn conversations. The problem is that such structures require a lot of handcrafting, and as a result, generalizing across different domains is difficult.

Current commercial dialogue agents (e.g., those used in call centers) are often brittle pattern-matching systems. So are some of the best chatbots created in the recent fervor. Below is a real life example.

Neural network models have the potential to revolutionize conversational AI: Researchers are hopeful that distributed representations of natural language and the availability of big dialogue corpora will give rise to end-to-end statistical conversation models, lifting chatbots beyond rigid pattern matching.

While some neural conversation agents are capable of carrying out social chit-chat to a certain degree, big challenges remain to build end-to-end task-oriented conversation systems. For example, a task-oriented system usually needs to interact with external databases or knowledge bases. How to incorporate and reason over external data while still preserving the end-to-end trainability of a conversation system is an open problem. Furthermore, the need to incorporate external data also makes it difficult to build cross-domain conversational systems.

General topics

In 2017, Amazon challenged 15 academic teams to build “a socialbot that can converse coherently and engagingly with humans on popular topics for 20 minutes.” How good are the socialbots in the competition? You may give it a try by saying “Alexa, let’s chat” to any Alexa-powered device. This is a good place to start understanding the state-of-the-art of conversational AI, as many socialbots in the competition employ both rule-based methods and machine learning approaches to engage with humans in conversation.

Several papers came out of the Alexa Prize experience [1, 2, 3]. For example, the team from McGill University introduced a deep reinforcement learning chatbot [3]. It consists of an ensemble of models, including NLG, retrieval models, template-based models, sequence-to-sequence neural network models, etc. It uses reinforcement learning to select an appropriate response from the models in the ensemble.

[1] Conversational AI: The Science Behind the Alexa Prize (2018)

[2] An Ensemble Model with Ranking for Social Dialogue (2017)

[3] A Deep Reinforcement Learning Chatbot (2017)

Another interesting and widely deployed socialbot is Microsoft’s XiaoIce. The following paper [4] may serve as a survey of social chatbot systems, as the architecture and components it describes are quite general. Also, during a conversation, XiaoIce may analyze human voices to predict what a person will say next, when to pause, and when it’s appropriate to interrupt the person. All these lead to more naturalistic conversations.

[4] From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots (2018)

On the other hand, we are not aware of any formal overview or survey of new developments in the field of conversational AI. One may check out some general introductory materials, including a Stanford lecture [5], as well as some recently formed communities, such as the NIPS 2017 workshop [6] on conversational AI.

[5] Stanford Lecture (CS124): Conversational agents

[6] NIPS Workshop on Conversational AI (2017)

Background and previous systems

There has been a lot of previous work on conversational systems, ranging from specific solutions such as an early Prolog-based conversational agent [7] that took context into consideration, to general discussions of NLP components required for conversational agents [8], as well as introductions to the field of spoken dialogue systems [9].

[7] Managing Context in a Conversational Agent (2002)

[8] Conversational agents (in Handbook of Internet Computing) (2004)

[9] Spoken dialogue systems (in Error Handling in Spoken Dialogue Systems) (2007)

A conversation is a special type of discourse, and hence work on linguistic analysis of discourse also applies to conversations. Here is an example. Grosz et al [10] described three separate but interrelated components of discourse: a structure of the sequence of utterances (called the linguistic structure), a structure of purposes (called the intentional structure), and the state of focus of attention (called the attentional state). These apply directly to conversations.

[10] Attention, intentions, and the structure of discourse (1986)

The most well-known previous conversational systems include Eliza (MIT, 1966), Parry (first system that passed the Turing test, 1975), and Alice (created in AIML -Artificial Intelligence Markup Language, 2009). They have been described in many papers.

As AI is fundamentally not ready to support fully-automated and open-domain conversations, one solution is to put humans in the loop. Chorus [11] is a conversational agent powered by a crowd of human actors, which enable it to work across multiple domains. Evorus [12] gets multiple responses from crowd workers and chatbots, and uses a voting mechanism to decide which responses to send to the end-user. Evorus also learns from the crowd feedback to automate itself over time for tasks such as (i) chatbot selection, (ii) reusing of old responses, and (iii) automatic voting.

[11] Chorus: A Crowd-powered Conversational Assistant (2013)

[12] Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time (2018)

Retrieval-based methods

If we have a large repository of conversations (e.g., open-domain conversations on a social network [13] or vertical-domain conversations between customers and human agents or crowd workers [12]), then for any user message, we may select a most appropriate response (originally made to other messages) from the repository, and send it as a reply to the user.

Specifically, Ji et al [13] formalizes short text conversation as a search problem and employs state-of-the-art IR techniques to carry out the task.

[13] An information retrieval approach to short text conversation (2014)

To improve the quality of the response, Wu et al [14] learns topics (through LDA) for messages and responses, and uses a topic-aware convolutional neural network to select a response.

[14] Response Selection with Topic Clues for Retrieval-based Chatbots (2016)

A follow-up work [15] takes more context (in a multi-turn conversation) into consideration when selecting a response. The challenge is to maintain the context for an on-going conversation.

[15] Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots (2017)

Generation-based methods

Unlike retrieval-based methods that reuse old messages to respond to new messages, generation-based methods construct new responses for new messages.

One of the early generation-based works uses phrase-based statistical machine translation (SMT) techniques to construct responses [16]. It “translates” a status post of a Twitter user into a plausible response. However, translating a message and responding to a message are quite different. A response is not a semantically equivalent paraphrase of a message, and a message can have a great variety of responses. Also, phrase-based machine translation is about word alignment, but word alignment is quite irrelevant in question answering or conversation.

[16] Data-Driven Response Generation in Social Media (2011)

A majority of generation-based methods use the deep learning technique known as sequence-to-sequence translation [17]. A Long Short-Term Memory (LSTM) model is used to encode an input sequence to a vector, and then another LSTM model is used to decode the vector to a target sequence. The decoding is often quite brittle, as errors may accumulate over time. Ranzato et al [18] addresses this issue by proposing a sequence level training algorithm that directly optimizes metric used at test time (such as BLEU or ROUGE). This enhanced version of sequence-to-sequence translation is adopted by a few deep reinforcement learning methods we will discuss later on.

[17] Sequence to Sequence Learning with Neural Networks (2014)

[18] Sequence level training with recurrent neural networks (2016)

Quite a few LSTM-based conversation models have been proposed [19, 20]. The strength of LSTM and its variants lie in their simplicity and generality. For example, the input sequence could be a concatenation of what has been conversed so far, which means the model naturally takes context into consideration in generating a response.

[19] A Neural Conversation Model (2015)

[20] Neural responding machine for short-text conversation (2015)

But sequence-to-sequence models are insufficient to model dialogues and conversations, as the authors of the neural conversation model [19] stated, “the objective function being optimized does not capture the actual objective achieved through human communication, which is typically longer term and based on exchange of information rather than next step prediction. The lack of a model to ensure consistency and general world knowledge is another obvious limitation of a purely unsupervised model.

Still, if we are able to narrow the problem domain then it is possible to find use cases for the sequence-to-sequence conversational model. In 2016, Google introduced an end-to-end method for automatically generating short email responses [21, 22]. Here, the conversation is not multi-turn and the responses are short (e.g., “Yes, I can”, “I will be there”, etc.), so to a certain extent, the problem has been reduced to that of classifying a message.

[21] Smart Reply: Automated Response Suggestion for Email (2016)

[22] Smart Reply and Implicit Semantics (2017)

Fine tuning response generation

Sequence-to-sequence models over-simplify the mechanism of human conversations. Currently, we don’t have techniques that can model natural languages, so of course we are not able to effectively model conversations. Given this, how do we improve the generative methods? We notice that although the generated responses are syntactically well-formed, they are often off-context, uninformative, or vague. We may statistically improve these aspects by introducing measures of context, diversity, persona, emotion, etc., into the response generation process. However, such improvements do not necessarily lead to meaningful conversations.

Context.

Context can be loosely defined as a sequence of past conversation exchanges of any length. What matters is how conversation systems represent contexts internally. For example, in a rule-based system [9], a context is represented explicitly as a set of rules that handle a particular situation. A matched rule may switch the current context to a new context, and the course of a conversation is a traversal of a graph in which each node is a context.

Explicit, rule-based context representation does not scale well to open-domain conversations. Here is where neural networks come in. Neural networks use continuous representations (embeddings) for input, output, and context, enabling it to generalize to any input. A straight-forward approach is to train a recurrent neural network language model (RLM) to generate responses word by word. Sordoni et al [23] extends a traditional RLM by conditioning it on a (input, context) pair. The paper proposes three different approaches. First, it concatenates every (input, output, context) into a single sentence, and train the RLM on such sentences. To generate a response, it performs forward propagation for (input, context) to obtain a hidden state, and computes the likelihood of the response from that hidden state. Second, they use a feed-forward network to map (input, context) from a bag of words to a fixed-size vector, and use the vector to update the recurrent state of the RLM. Third, similar to the second method, except that they represent input and context as two separate bags of words in the feed-forward network.

[23] A Neural Network Approach to Context-Sensitive Generation of Conversational Responses (2015)

There are several other works worth mentioning. Dusek et al [24] proposes a context-aware response generator in a task-oriented setting where input and context come with semantic frames and slots annotations. The Latent Variable Hierarchical Encoder-Decoder Model [25] includes a latent variable at the decoder, which is trained by maximizing a variational lower-bound on the log-likelihood. The goal is to model conversation in a two-step generation process—first sampling the latent variable, and then generating the output—while maintaining long-term context.

[23] A Context-aware Natural Language Generator for Dialogue Systems (2016)

[25] A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues (2015)

Speaker role & Persona

The concept of speaker role is best explained by the following example [27]. A user reports a problem. The vanilla model produces a generic, uninformative response, while the speaker role (technical support) model endeavors to respond in a problem solving mode.

User: I am getting a loop back to login page.
Vanilla LSTM model: Ah, ok. Thanks for the info.
Speaker-role enhanced model: I’m sorry to hear that. Have you tried clearing your cache and cookies?

Luan et al [26] differentiates between a questioner role and an answerer role in the training of an RLM. In addition, LDA modeling is used to derive a topic representation of the context, which is then used to update the recurrent state of the RLM. In their follow-up work [27], the authors jointly train two models that share the decoder parameters: a seq2seq conversational model trained on general conversation data and an auto-encoder trained on personal data from target speakers.

Multitask learning. The decoders shared the parameters [25].

[26] LSTM based Conversation Models (2016)

[27] Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models (2017)

Instead of modeling a class of speakers (e.g., personnel of IT support roles), Li et al [28] focuses on the coherent personality of an individual speaker. The goal is to avoid situation illustrated by the following example. Clearly, the lack of a coherent personality makes it difficult for sequence-to-sequence based systems to pass the Turing test. The work encodes persona information into a speaker embedding and allows conversation data of similar users on social media to be shared for model training.

message: Where do you live now?
response: I live in Los Angeles.
message: In which city do you live now?
response: I live in Madrid.

[28] A Persona-Based Neural Conversation Model (2016)

Diversity.

Another important quality measure of response is diversity. For example, it makes sense to avoid meaningless but universally relevant replies (e.g., “I don’t know.” or “I’m OK”). To address this problem, for any message, Mou et al [29] predicts a noun that is most relevant to the message, and generates a response containing the noun.

PMI predicts a word “Osaka”; Backward seq2seq generates “from am I”, and forward seq2seq uses query and “I am from” to generate the rest [27]

[29] Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation (2016)

More generally, Li et al [30] proposes to replace the log likelihood target function, which is prone to producing generic responses, with the Maximum Mutual Information (MMI) objective function, which maximizes the mutual information between each input and response pair, in the seq2seq model.

[30] A Diversity-Promoting Objective Function for Neural Conversation Models (2016)

Zhao et al [31] models conversations as a one-to-many problem at the discourse level. It uses a conditional variational autoencoder (CVAE) to sample more informative responses.

valid responses from B for different assumptions of the latent variables [30]

[31] Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders (2017)

Affect and Emotion

Humans convey feelings through the use of emotionally colored words in conversations. To enable machines to do this, Ghosh et al [32] conditions the LSTM language model on some predefined affect categories. The affect categories are inferred from the context through keyword spotting. For example, the affective representation of the sentence “i will fight in the war” is {“sad”:0, “angry”:1, “anxiety”:0, “negative emotion”:1, “positive emotion”:0}.

[32] Affect-LM: A Neural Language Model for Customizable Affective Text Generation (2017)

For any user input (e.g., “Worst day ever. I arrived late because of the traffic.”), Zhou et al [33] proposed to produce responses guided by a set of predefined emotion categories (e.g., a happy response might be “Keep smiling! Things will get better.” and a disgust response might be “Sometimes life just sucks.”). This is again achieved by conditioning the decoder on either internally or externally decided emotion category signals.

[33] Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory (2017)

Asghar et al [34] augment word embeddings with a 3D affective score by using an external cognitively-engineered affective dictionary, which maps 13,915 lemmatized English words to the 3D space of valence, arousal, and dominance. Then it employs several strategies (loss functions) in training, for example, strategies that either minimize affective dissonance, maximize affective dissonance, or maximize affective content.

[34] Affective Neural Response Generation (2017)

Others.

Many other approaches have been proposed to improve the quality of generated responses. For example, Yao et al [35] uses a neural network to model the attention and intention processes in discourse [10]. Inaba et al [36] uses an RNN to rank candidate utterances with respect to their suitability in relation to a given context. Xu et al [37] trains a seq2seq model for response generation together with a discriminative classifier that measures the differences between human responses and machine-generated ones. Liu et al [38] proposes an adversarial training method for the slot filling task in spoken language understanding that can be shared across multiple domains. Wu et al [39] proposes to use different vocabulary in decoding for different input to improve relevance as well as decoding speed.

[35] Attention with Intention for a Neural Network Conversation Model (2015)

[36] Neural Utterance Ranking Model for Conversational Dialogue Systems (2016)

[37] Neural Response Generation via GAN with an Approximate Embedding Layer (2017)

[38] Multi-Domain Adversarial Learning for Slot Filling in Spoken Language Understanding (2017)

[39] Neural Response Generation with Dynamic Vocabularies (2017)

Dialogue state tracking

Sequence-to-sequence methods build end-to-end trainable dialogue systems, but such systems usually do not have the capability of supporting domain-specific tasks.

Before we discuss task-oriented systems, we briefly review DST (dialog state tracking), which is the process of representing the state of a dialog [40, 41, 42]. DST is a well-studied topic in the field of automatic speech recognition (ASR) and spoken language understanding (SLU).

A dialog state tracker observes signals from ASR and SLU components and access content in external databases or knowledge bases. It takes as input a set of possible dialog state hypotheses, where a hypothesis is an assignment of values to slots in the system. It then outputs a probability distribution over the set of hypotheses (the distribution is denoted as the tracker’s belief or the belief state).

[40] The dialog state tracking challenge (2013)

[41] The dialog state tracking challenge series: A review (2016)

[42] Machine Learning for Dialog State Tracking: A Review (2015)

A variety of belief trackers have been proposed, ranging from rule-based trackers to CRF-based trackers. Most production systems use rule-based heuristics to update the belief state. Advanced statistical methods have been developed to improve the accuracy by exploiting the correlation between turns. Some state-of-the-art belief trackers consider DST as a supervised sequential labeling problem. They use recurrent neural networks (RNN) to update belief states based on a sequence of ASU and NLU outputs [43, 44]. For example, Henderson et al [43] uses an RNN for each slot the system tracks. If there are N possible values for a slot, then the RNN outputs a probability distribution in R^(N+1) , with the last component giving the probability of the none hypothesis. Mrkšić et al [44] proposes to use dialog data from different dialog domains to train a general belief tracking model that can operate across all of these domains.

[43] Word-based dialog state tracking with recurrent neural networks (2014)

[44] Multi-domain dialog state tracking using recurrent neural networks (2015)

Most conversational agents deal with a single task with a simple user goal. Lee et al [45] proposes a statistical DST solution that handles multiple tasks and complex goals (e.g., it handles requests such as “Connection to Manhattan and find me a Thai restaurant, not Italian.”) Unlike previous methods that keep overriding dialog states, the proposed approach organizes possible dialog states into a dynamically growing tree structure with lineages, providing richer possibilities for later processing.

[45] Task Lineages: Dialog State Tracking for Flexible Interaction (2016)

Task-oriented systems and reinforcement learning

Unlike end-to-end learned, open-domain conversation systems, most task-oriented dialog systems use slot-filling methods to capture user intent in a domain specific conversation. A task needs to be pre-defined by a set of manually crafted states with multiple slots. As a result, it is difficult to adapt a dialog system designed for one task to another task.

Recent task-oriented systems cast dialogue as a reinforcement learning problem where rewards are given based on how well a task is completed (The generative methods we have reviewed up to now produce responses one at a time without considering their long-term effect). They are recently boosted by deep reinforcement learning, which eliminates the need for feature engineering.

Young et al [46] models a dialogue as a partially observable Markov decision process (POMDP). It consists of a dialogue model M and a policy model P. The dialogue model has a transition probability p(s_t|s_{t−1}, a_{t−1}) and an observation probability p(o_t|s_t), where s_t is state of the dialogue at time t, a_t is the action taken at time t, and o_t is the observation (the output of the SLU) at time t. The policy model determines which action to take at each turn. As the dialogue progresses, a reward is assigned at each step designed to mirror the desired characteristics of the dialogue system. The dialogue model M and policy model P can then be optimized by maximizing the expected accumulated sum of these rewards either on-line through interaction with users or off-line from a corpus of dialogues collected within a similar domain.

[46] POMDP based statistical spoken dialog systems (2013)

Gunasekara et al [47] aims at building a conversation system that separates logic and language. It “delexicalizes” a user input, that is, it replaces instances in a user input by their types (e.g., “Italian food” and “Thai dishes” will be replaced by “cuisine type”). The preprocessed user input is converted into an embedding. It then clusters the embeddings, and as a result, a conversation is represented by a sequence of clusters. From the conversations, we can learn cluster transition probabilities, which can be used to predict responses. The paper also describes an approach of using two classifiers for dialog state update.

[47] Quantized-Dialog Language Model for Goal-Oriented Conversational Systems [Slides] (2016)

Inspired by the success of AlphaGo, Li et al [48] introduces a deep reinforcement learning method for dialogue generation. They simulate the process of two virtual agents talking with each other. One of the challenges is creating the right reward strategy. For this purpose, for each action, they quantify ease-of-answering (measured by the negative log likelihood of responding to an utterance with a dull response), information flow (detecting whether keeps introducing new new information by measuring semantic similarity of consecutive turns), and semantic coherence (consider the mutual information between the current action and previous turns in the history to ensure the generated responses are coherent and appropriate).

[48] Deep Reinforcement Learning for Dialogue Generation (2016)

Wang et al [49] also simulates a self-play between two participants in the conversation, but it takes into consideration the asymmetric roles of a customer and a client in most task-oriented dialogues. It creates a customer model by sequence-to-sequence learning, and then leverage the customer model to train the client model by deep reinforcement learning.

[49] Integrating User and Agent Models: A Deep Task-Oriented Dialogue System (2017)

Peng et al [50] uses a hierarchical deep reinforcement learning approach for complex tasks (e.g., travel planning) that need to be decomposed into a set of subtasks. It introduces a dialogue manager that consists of a top-level dialogue policy that selects among subtasks, a low-level dialogue policy that selects actions to complete the subtask, and a global state tracker that ensures all cross-subtask constraints are satisfied.

[50] Composite task completion bot with Hierarchical RL [Slides] (2017)

Ilievski et al [51] introduces a transfer learning method to address the problem of lack of task-specific training data by exploiting the similarity between a source and a target task (e.g., restaurant and movie booking both model time and location).

[51] Goal-Oriented Chatbot Dialog Management Bootstrapping with Transfer Learning (2018)

Much effort has been devoted to improve goal-oriented dialog systems. For example, Joshi et al [52] considers personalization and Lipton et al [53] focuses on efficient exploration for dialogue policy learning. New applications continue to emerge, for example, Toxtil et al [54] developed chatbot to coordinate team work.

[52] Personalization in Goal-Oriented Dialog (2017)

[53] BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems (2017)

[54] Understanding Chatbot-mediated Task Management (2018)

End-to-end, task-oriented systems

End-to-end conversation systems are easy to train, but they usually do not work in task-oriented settings. In particular, task-oriented dialog systems need to interface with external databases through queries, but neural end-to-end models do not provide intermediate symbolic representations. Is it possible to build task-oriented systems through end-to-end training?

In the following, we look into two strategies of building end-to-end, task-oriented systems. The first strategy [55, 56, 57, 58] parses the incoming message and construct a symbolic query to the underlying database. One issue is that the database retrieval operation is non differentiable. Thus, we need to train components (e.g., the parser and the dialog policy module) separately. In other words, it is not a complete online end-to-end system. The second strategy [59, 60, 61] uses memory networks to manage the database as well as the dialog history in the neural network system. This solves the differentiable problem, i.e., it neuralizes database retrieval, and as a result, the entire system can be trained end-to-end in an online fashion. However, the data it can manage is usually small and simple. Thus, it does not work for end-to-end dialog systems that rely on large external datasets.

Wen et al [55] proposes an end-to-end task-oriented system that converts a user input into two internal representations: a distributed representation generated by an intent network (LSTM based) and a probability distribution over slot-value pairs generated by a set of belief trackers. The database operator then composes a query by selecting the most probable values in the belief state, and the result of the query, along with the intent representation and belief state are transformed and combined by a policy network to form a single vector representing the next system action. This system action vector is then used to condition a response generation network which generates the required system output token by token in skeletal form. The final system response is formed by substituting the actual values of the database entries into the skeletal sentence structure.

[55] A Network-based End-to-End Trainable Task-oriented Dialogue System (2017)

Williams et al [56] presents a model for task-oriented conversation that combines supervised learning (LSTM) and reinforced learning (RL). In the beginning, it needs training data provided by domain experts. Once the system operates at scale, it uses RL (with a policy gradient approach) to continue to learn. Specifically, as shown in the picture below, a user input, together with entities extracted from the input, is converted into a feature vector, which is fed into an LSTM. The LSTM outputs a distribution of actions, which will be filtered by an action mask. Next, if RL is off, then it outputs the action that has the highest probability. If RL is on, then it samples from the actions.

Combining sequence-to-sequence learning, business logic, and reinforcement learning in task-oriented dialogue [40]

[56] End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning (2016)

Zhao et al [57] proposes an end-to-end framework that jointly optimizes the NLU, the DST and the dialog policy as a single module. In particular, a dialog policy action can modify a query hypothesis, which is of a slot-filling form consisting of most likely slot values given the observed evidence. Given the hypothesis, the database can perform a normal query and give the results as observations and rewards.

[57] Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning (2016)

Li et al [58] performs user simulation to generate training data. An LSTM powered NLU model produces semantic frames, which are used as input to the DST process (which conducts database queries) and then the policy learning module to generate actions.

reinforcement learning trains all components in an end-to-end fashion

[58] End-to-end task completion neural dialogue systems (2017)

Sukhbaatar et al [59] introduced a Memory Network for question answering. As shown in part (a) of the figure below, sentences are represented as BOWs and converted into embeddings. Given a query, we find the most relevant memory, and the controller finds the final answer. In part (b), multiple memory networks are stacked together (each using different embeddings) to perform the same task.

[59] End-To-End Memory Networks (2015)

Bordes et al [60] applies the memory network solution for QA to dialog. The network architecture is similar to that of [59], where the memory component contains the history of the dialog, and the query is the current input from the user. The work does not generate responses, instead, it selects from a set of predefined responses. Note that the database used in thew work is small and simple and so is the training data, which consists of 43 patterns of user utterances and 20 patterns of bot utterances instantiated over the database. The way the training data is generated also means it is hard to support cross-domain dialog.

[60] Learning End-to-End Goal-Oriented Dialog (2017)

Dodge et al [61] uses memory network for a set of tasks including QA, recommendation, and conversation. The memory network stores long term memory (sentences in a database that contain words that appeared in the current conversation) and short term memory (messages of the current conversation). It performs matching with the current user input as usual (inner product followed by softmax). The final prediction is again through selection from candidate answers instead of using a generative approach.

[61] Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems (2015)

External databases and knowledge bases

How to interface with databases is at the core of task-oriented conversation systems. In the previous section, we discussed two approaches: one based on symbolic queries, and the other based on distributed representations and memory networks. There are more distributed representation based approaches, which we discuss here, but even the best ones are inadequate. They either represent knowledge bases as unstructured text, and use keywords in the context to find relevant knowledge, or they represent knowledge as (key, value) pairs, and use context to match keys in order to return corresponding values. Clearly, the neural approaches of data access is much weaker than symbolic approaches.

Lowe et al [62] extends the dual encoder model (a combination of one RNN for context and another RNN for response) to include external unstructured knowledge bases, which are modeled by a third RNN. Responses are selected by the three RNNs. The external knowledge base could be very big, so one question is what to include as input to the knowledge RNN? Low et al [62] includes external information that contains keywords extracted from the message and the context. This simple treatment has two issues. First, relevant knowledge do not necessarily share keywords with the message or the context. Second, the knowledge base is handles is limited to unstructured text.

[62] Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems (2015)

Ghazvininejad et al [63] takes a similar approach to [62] and uses unstructured text as external knowledge to enhance chit-chatting. In the figure below, the dialog encoder and response decoder together form a sequence-to-sequence model. The model conditions responses on both conversation history and external facts. Here, World Facts are text (e.g., Foursquare, Wikipedia, or Amazon reviews) indexed by named entities as keys. The contextually relevant fact are derived through keyword matching or entity linking on conversation history, and then joining with the world facts.

[63] A knowledge-grounded neural conversation model (2017)

Han et al [64] uses a knowledge base (KB) to generate rich and relevant responses for chit-chatting (open-domain conversation). Given a question like “Do you like Messi?” a system without KB may respond “Why do you like Messi?” but with KB, it may respond “Do you like Beckham, too?” It extracts named entities from a user utterance, then scans Freebase to obtain information related to this entity. For example, after identifying entity Messi, it obtains the type of the entity (“football player”), and through the type, it finds related top entities (e.g., Beckham), properties (e.g., “position”), etc. This enables the system to ask questions such as “Do you like Beckham, too?” or “What’s the position of Messi?”

[64] Exploiting knowledge base to generate responses for natural language dialog listening agents (2016)

The system proposed by Zhu et al [65] is quite similar to the above-mentioned work [62, 63, 64]. It detects entities in the input message, then retrieves a set of possible facts from the knowledge base using the detected entities. A message encoder encodes the input message into a set of vectors at each time step, and a reply decoder takes the encoded message and facts to generate a response word by word. Compared with [62, 63, 64], it uses a more sophisticated scoring based approach to select answer entities.

[65] Flexible End-to-End Dialogue System for Knowledge Grounded Conversation (2017)

Instead of worldly facts, Young et al [66] uses commonsense knowledge (concepts covered in the dialogue) to enrich the conversation. For example, if a user mentioned Hawaii, given the commonsense knowledge Hawaii isA volcanic island, the system may generate a response “take some pictures of volcanos!”

[66] Augmenting End-to-End Dialog Systems with Commonsense Knowledge (2018)

Vougiouklis et al [67] builds a dataset that consists of ∼15k sequences of comments (on Reddit) aligned with ∼75k Wikipedia sentences. The goal is to generate a context-sensitive response to a sequence of comments by incorporating background knowledge.

[67] A Neural Network Approach for Knowledge-Driven Response Generation (2016)

J. Yin et al [68] and P. Yin et al [69] propose models for querying knowledge bases via natural language in a fully “neuralized” way.

[68] Neural generative question answering (2016)

[69] Neural enquirer: Learning to query tables (2016)

Dhingra et al [70] addresses the differentiability issue by replacing symbolic queries with an induced “soft” posterior distribution over the knowledge base that indicates which entities the user is interested in. This posterior probability distribution, together with the explicit belief state, is used as input for a reinforcement learning policy. Also, unlike previous works [68, 69] that focus on parsing one complicated natural language query to get the right answer, Dhingra et al [70] supports user interactions through a sequence of short, simple queries (of length less than 5 words).

[70] Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access (2017)

Eric et al [71] uses a key-value retrieval network to support multi-domain, grounded conversation. Similar to Dhingra et al [70], it supports differentiable end-to-end training. Furthermore, it eliminates the need of explicitly modeling belief trackers and dialogue state information. Its attention-based key-value retrieval mechanism is able to learn how to extract useful information from a structured knowledge base in an end-to-end fashion. Specifically, it converts a knowledge base into a set of (subject, relation, object) triples, where (subject, relation) is considered as the key of the triple. As an example, (“meeting”, “time”, “5pm”) is a triple, and its key is the embeddings of (“meeting”, “time”). It then expands the output vocabulary of the decoder to include all distinct object values (e.g., “5pm”). During decoding, it takes the decoder hidden state and computes an attention score with the key of each triple. This allows the decoder to output certain object values (e.g., “5pm”).

[71] Key-Value Retrieval Networks for Task-Oriented Dialogue (2017)

Datasets and Evaluation

Finally, here are several papers on datasets for training and metrics for evaluating conversation systems. For example, Lowe et al [75] proposes a dual encoder for training end-to-end dialogue systems, but it is more well-known for the introduction of the Ubuntu dataset.

[72] Developing non-goal dialog system based on examples of drama television (2014)

[73] Chatbot Evaluation and Database Expansion via Crowdsourcing (2016)

[74] Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models (2017)

[75] Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus (2017)

Conclusion

Recently, conversational AI has attracted a lot of attention in academia as well as in industry. The commercial success of smart speakers gives rise to an illusion that AI is near, but in reality, Google Assistant, Alexa, and Siri heavily rely on handcrafting efforts. Even so, they are very primitive when it comes to real life conversations. Meanwhile, neural chatbots are considered promising. With distributed representations and the availability of big dialogue corpora, end-to-end statistical conversational models are trained directly from data. However, task-oriented neural chatbots are still at a very early stage. In fact, before fundamental problems such as how to access external databases and knowledge bases are truly solved, end-to-end neural conversation agents will remain nothing more than toys.

--

--

Haixun Wang
AI Graduate

VP of Engineering and Distinguished Scientist @ Instacart.