Episodic Memory Modeling for Conversational Agents
This summer at betaworks, I experimented with an abstract predictive approach to memory modeling which can be directly applied to any conversational agent.
Conversational agents, or chatbots, are gaining increasing traction in today’s tech community. Advocates of chatbots argue that the app space is collapsing under the weight of its own popularity: the proliferation of mobile apps today means that for each new service, users are forced to download and store a new app, to undergo yet another onboarding process, and to familiarize themselves with yet another UI. On the other hand, chatbots provide similar services and integrate nearly seamlessly with existing messaging services; there is little to no associated storage cost and the user experience is dictated primarily by an already-familiar messaging app.
The fundamental advantage of the chatbot — the messaging platform — is also its greatest challenge because it inevitably begs the comparison: how does bot-to-human conversation stack up against traditional human-to-human conversation? On one hand, many chatbots provide focused functionalities that the average human cannot or will not provide: weather-reporting, recommending news stories, and serving as a dedicated source of information about an individual or organization. (See PonchoBot, DiggBot, and Olabot.) Conversely, chatbots often fail to meet the same standards of natural conversation that we hold with our friends, family and colleagues, on the same messaging apps, through the same chat medium: they often forget what we say, repeat the same responses, and cannot converse dynamically on topics outside of their intended use.
Here, we will address the first problem: that bots often forget what we say. In particular, bots often only remember elements of conversation that they are hard-coded to identify and record, like changes in notification settings or subscriptions for different product-related services. Ideally, however, these bots should remember components of our conversations as a human conversant would; we want them to have episodic memory, to remember what we have told them about ourselves, and to use this information to create to an interesting and fluid conversation, like a human conversant might. Conversely, we might not want them to remember every single thing we’ve said; we generally do not expect that of human conversants and, in fact, might consider it exceptional or unusual if they did. Accordingly, we would like our bots to have episodic memory: long-term memory of important components of past conversations.
A Predictive Approach to Memory Modeling
A simple answer to the question of episodic memory modeling in bots is to hard-code all elements of conversation that we would like the bot to remember. At every stage of the conversation, we could extract the core information and retain it as the conversation continues. However, this method gives us no way of weighing the relevance of this information over time. For example, it is possible that the most recent information is actually the least relevant to a future topic of conversation. Of course, this is not always the case, but without an intelligent way to determine relevance, we are often left guessing and, at best, creating manually-tuned nested conditionals within which we determine what information to store and what to discard. Unless we can map out the explicit conditions under which each fact should be remembered and referenced, we cannot incorporate information extracted from conversations over a relatively long period of time. Clearly this is an intractable and unscalable approach to the memory modeling task that cannot be easily or quickly abstracted to an arbitrary chatbot.
For this reason, we choose to take a predictive approach to the problem of episodic memory modeling. First, for each chatbot, we select certain topics or categories within which we would like the bot to achieve episodic memory. For example, it is important for the Poncho bot to have episodic memory of the most recent location that a user asked for. On the other hand, we may be somewhat indifferent towards Poncho’s ability to remember other facts about us — like what kind of car we drive, our favorite musicians, etc. Just as we might expect friends or acquaintances to remember certain facts about us but not others, we specifically choose the topics or subjects regarding which we would like our bots to exhibit episodic memory. Choosing the subjects that we expect our bot to remember is a decision that affects both the utility and personality of our bot. For example, intelligently remembering a user’s various usage preferences may improve the quality of services that the bot provides. On the other hand, remembering small details about a user — like their birthday or favorite color — may provide an element of surprise that makes the bot seem more like a human or friend.
Once we select the particular subjects within which we would like our bot to exhibit episodic memory, we extract those topics from each user message to the bot; a single user message may contain many of these “memory topics” or none of them. In the language of supervised machine learning, we can then use these sets of topics as our labels for the raw chat log data.
For example, we might expect that the Poncho bot has episodic memory about “locations” (i.e., if the user always asks about the weather in New York and Chicago) and about specific “weather” conditions (i.e., if the user always asks about rain or always asks about the humidity). In this case, we would have two topics to remember: “weather” and “location”. Any given query from the user can be about all of these topics (“Is it raining in New York?”), just one of them (“What is the weather like?”), or none of them (“Hi!”). We will call the latter topic “miscellaneous”; this includes interjections, as well as messages that imperfect topic extraction systems fail to flag for any appropriate topic (e.g., a location extractor which fails to identify “sf” as referring to the location “San Francisco, USA”). Therefore, any given message can be labelled with the following combinations of topics — “weather”, “location”, “weather-location”, “miscellaneous” — and we can use these combinations of topics as labels for our raw chat log data.
Now, at every stage of the conversation, we ask: given what has already been said, can we predict what the user will talk about next? Mathematically, this entails building classifiers that take as input a labelled sequence of user queries within a single conversation and output a prediction for the next topic set. By forming an expectation of what will be said, we can accordingly form an expectation of what we should have remembered. At this stage, we have two options: (1) we can mine previous stages of the conversation (in reverse order) for the appropriate fact and attempt to incorporate this fact into our subsequent response, or (2) we can proactively intervene in the conversation and try to direct the conversation towards the predicted topic. Regardless, our expectations of the future informs our decision of which facts to retain from prior stages of conversation.
Episodic Memory Modeling with Finite State Machines
At betaworks, we have explored two distinct approaches towards creating the required classifier. In the first approach, we view the conversation as a finite state machine in which each possible set of memory topics is viewed as a possible state. The finite state machine transitions between the states corresponding to the sets of topics extracted from each user query in the conversation. For example, when interacting with the Poncho bot, a user might have the following conversation:
Accordingly, we would label the user queries in the manner described above:
This conversation can be represented by the following finite state machine:
Finally, at every stage of the conversation, we must predict what state the user will transition to next. Therefore, at every stage, we will feed in the series of topic combinations that have been traversed in the conversation thus far, and use this sequence to predict the next state or topic combination in the conversation:
At this stage, there are a number of classifiers which can be trained to predict the next state in the conversation. The appropriate choice of classifier is generally dependent on the underlying distribution of conversational patterns and, accordingly, the nature of human interaction with the chatbot. Classifiers explored in this stage can include non-sequential classifiers like Naive Bayes models as well as sequential classifiers like Conditional Random Fields.
Our best model within this paradigm — which we consider our baseline model — was a Naive Bayes model which used the prior two states of the conversation to predict the next one; it exhibited a training accuracy of 53.9% and a holdout accuracy of 53.7%. Over various experiments, we found that the success of our model varied greatly based on small changes in the quality of our preceding topic models; this is an inherent challenge of purely topic-based approaches to memory modeling.
Query-based Classifiers for Memory Modeling
Unfortunately, the finite state machine approach to memory modeling can also be challenging for bots which human users tend to query in rich, complex sentences. (This is particularly true for bots like Poncho bot, which reply to users with full, conversational responses and accordingly encourage users to respond in kind.) The finite state machine approach to memory modeling forces us to exclusively use the sets of topics extracted from each query for predicting the next state or set of topics. Unfortunately, by reducing the information extracted from each query into a set of topic tags, this approach abstracts away a lot of information encoded in the original natural language query. In order to mitigate this problem, we might feel encouraged to build increasingly complex topic extractors and to expand our topic set (which would in turn force us to build new, more highly-complex topic extractors). Depending on the topic sets, building effective and complex topic extractors for increasingly specific topics can be an extremely challenging task and, moreover, one that still requires us to discard much of the information in the original query.
Therefore, we chose to circumvent this issue by creating classifiers which take user queries, or series of user queries, as input. We considered two specific implementations of this solution.
Neural Networks Approach
In the first implementation, we considered a neural language modeling approach to predicting the next state of the conversation using the method described by O. Vinyals et. al. in “A Neural Conversational Model” [1]. This method incorporates recurrent neural networks to generate bot responses to user queries. We considered modifying it slightly to predict the next user query in the conversation. Then we would extract the predicted next state by passing the predicted query through our topic extractors.
We chose to shelve this implementation for the reason that the response of neural networks is often difficult to decipher and audit; this is an especially challenging issue for branded chatbots which must have easily manageable and tunable behavior which aligns closely with the product’s branding. (Those who are interested in such solutions to memory modeling should also consider the possibility that they may not have enough data for a successful neural memory model. Moreover, some topic sets or states may be rare enough in the dataset that a neural network approach would exhibit poor recall in these cases.)
Classical NLP Approach
Our second implementation was our most successful. We used traditional bag-of-words or TF-IDF methods (depending on predictive performance) to transform user queries into elements in a finite dimensional vector space; these vectors were passed as inputs to traditional machine learning classifiers which predicted the next state of the conversation. Optionally, entire conversations — or components of it — can be concatenated into a single document before obtaining vectorized representations; those seeking to incorporate a temporal element may also choose to instead concatenate the vector representations of consecutive components of conversations.
Classifiers trained using these vectors and the corresponding next conversational states have significantly better accuracy and easily bypassed the issue of topic extraction systems which abstract away too much data, with a training accuracy of 62.6% and a holdout accuracy of 61.7%
We also measured the ability of our model to correctly label each individual state — that is, its recall in various label classes. Below is a small sample of these results on experiments where the topic set was defined as ‘MISC’, ‘LOCATION’, ‘WEATHER’, ‘TIME’, ‘FOOD’, and ‘ACCESSORY’.
These results show that our model is extremely successful at predicting moments in which the user may wish to discuss topics outside of the prescribed memory state topics — that is, moments at which — assuming the success of our topic classifiers — the user wishes to converse with the bot as a friend without requesting the principle services. It is also evident that recall varies greatly between classes. This is partly due to the lack of data available in many classes — ‘MISC’ is by far the most common label in our dataset, followed by ‘WEATHER’ and ‘LOCATION’, and these classes exhibit the best recall. In comparison, the ‘ACCESSORY’ class — which is comparatively infrequent — provides far less data to the model, and consequently has poor recall.
Remaining Challenges
The methods discussed above can be used to develop interesting predictive solutions to the problem of memory modeling for bots. However, three distinct challenges remain which we discuss briefly below:
Imbalanced Dataset
Distinct states or labels for our classifiers are determined by combinations of topics. This means that the number of possible states is generally much larger than the number of topics. It is possible that there are certain states — or combinations of topics — which are extremely rare in the dataset; for this reason, most classifiers will likely exhibit poor recall for these states. This was generally true in experiments using chat logs from chatbots which initially offered primarily transactional services.
Topic Extraction Systems
Certain topics can be much harder to extract from natural language data than others. Poor topic extraction systems can lead to noisy labelling of data; this in turn can generate poor predictive systems. The detrimental effect is particularly severe in memory modeling implementations which rely on the finite state machine approach, since such approaches are doubly reliant on effective topic extractors.
Shorter Conversation Sessions
We defined a “conversation” as a series of back-and-forths between a user and a chatbot in which subsequent user queries were within a 15–20 minute time window. (Chatbot responses are usually immediate or near-immediate). Depending on the particular chatbot being modeled, it is possible that such conversations are relatively short. (This is not necessarily an unwelcome quality: in certain bots, like an FAQ bot, shorter conversations may indicate better performance.) However, this may make sequential approaches to memory modeling — for example, through Conditional Random Fields — more challenging since the relevant sequences may be quite short.
Takeaways
In order for bots to become satisfactory interlocutors, we must develop episodic memory models which adequately mimic memory traits supported in human conversations. By pursuing a predictive approach, we avoid generally tedious manual solutions to the problem. Arguably, our approach also strives to mimic some of the observable characteristics of episodic memory: that humans retain certain crucial elements of conversation and, when prompted, recall these elements with some degree of certainty or uncertainty about their relevance to the topic at hand.
References
[1] Oriol Vinyals and Quoc Le. “A Neural Conversational Model” (2015). http://arxiv.org/pdf/1506.05869v3.pdf