There’s a lot of fluff surrounding chatbots, so I wrote this post to lay out the basics. I first review the theory of conversation to give us a sense of what we are aiming for. I then discuss three classes of chatbots. The simplest class is purposeless mimicry agents, which only provide the illusion of conversation. Members of this class include ELIZA and chatbots based on deep learning sequence-to-sequence models. The second and next most sophisticated class comprises intention-based agents such as Amazon’s Alexa and Apple’s Siri. These agents have a simple understanding and can do real stuff, but they generally can’t have multi-turn conversations. The third and most sophisticated class is conversational agents that can keep track of what has been said in the conversation and can switch topics when the human user desires.
Theory: What is conversation?
Conversation begins with shared reference. We point to objects in the world so that we and our conversational partner know that we are talking about the same things.
Children use pointing to communicate desire for objects and shared interaction before they can speak. As we learn language, words then point to shared ideas in our minds (Gärdenfors, 2014).
What things words point to becomes a shared convention over time through language games, as described by Wittgenstein in Philosophical Investigations (1958). Language games are interactions that shed light on a particular aspect of language. Through cooperation in shared action, such as building a hut, we settle on labels for objects. When we ask for a “beam” we nod approvingly when someone finally brings us the correct item.
Of course, language is more than labels. Our brains map the concepts behind these community conventions to personal sensations and actions. When someone says “beam,” we associate that with our experience with beams, and since people generally have similar experiences, we are able to understand each other. When someone says they hurt their back carrying a beam, we understand because we too have carried heavy things. The meaning is grounded in that experience (see Benjamin Bergen, Steven Pinker, Mark Johnson, Jerome Feldman, and Murray Shanahan).
When our experiences and language games don’t sufficiently overlap, we must negotiate meaning in the course of a conversation. Consider the image of the discourse pyramid below. When everything is understood, we are at the bottom level and can just give and receive instructions for cooperation. When something is not understood, there is a break and we have to do a coordination of inner worlds, and if that isn’t understood we have to do a coordination of meaning. Once we have acknowledgment of understanding at a particular level, we can go back down to the previous level. For example, we ask a guy to get us some fish. He doesn’t understand how, so we have to coordinate our inner worlds and explain that he can catch fish with a fishing pole. When he doesn’t even understand what a fishing pole is, we have to coordinate meanings and explain that a fishing pole is a tool that consists of a stick, string, and a hook.
In addition to the meanings of words and sentences, conversation itself has its own rules. Consider Grice’s (1975, 1978) conversational maxims:
Maxim of Quantity: Say only what is not implied. Yes: “Bring me the block.” No: “Bring me the block by transporting it to my location.”
Maxim of Quality: Say only things that are true. Yes: “I hate carrying blocks.” No: “I love carrying blocks, especially when they are covered in fire ants.”
Maxim of Relevance: Say only things that matter. Yes: “Bring me the block.” No: “Bring me the block and birds sing.”
Maxim of Manner: Speak in a way that can be easily understood. Yes: “Bring me the block.” No: “Use personal physical force to levitate the block and transport it to me.”
Breaking these rules is a way to communicate more than the meaning of the words. Wikipedia has a nice summary here. When we break a maxim, it is assumed that it is for some purpose. For instance, when I say “I love carrying blocks, especially when they are covered in fire ants,” I am breaking the maxim of quality to use sarcasm to communicate that I don’t like carrying these blocks. As another example, if someone asks me how good Bob was as an employee and I respond that he had nice hair, I am communicating an idea by breaking the maxim of relevance.
Systems for converting speech-to-text have recently gotten pretty good, for example here and here, and we will largely ignore that aspect in this post, but sometimes meaning isn’t even in the text but in how the words are spoken (called prosody). Consider the following three examples taken from Voice User Interface Design by Giangola and Balogh (2004). The italicized words are emphasized. Based on the emphasis, each example means something different (stated afterwards).
- You know. I don’t. [So don’t ask me.]
- You know. I don’t. [As a matter of fact, I really don’t.]
- You know I don’t. [You know that I don’t.]
Conversation is really hard. Yuval Noah Harari argues in his book Homo Deus that our ability to cooperate is what has allowed humans to take over the planet. Being able to converse with machines will allow us to further cooperate with this new kind of intelligence we have created. Let’s now turn our attention to the state of the art.
Practice: The current state of chatbots
We will look at the three classes of chatbots in turn: purposeless mimicry agents, intention-based agents, and conversational agents.
Purposeless mimicry agents
Purposeless mimicry agents give the appearance of conversation without understanding what is being said. We’ve all heard of ELIZA. ELIZA consisted of simple substitution rules to mimic a psychologist from the 1960s. This is the psychology that emerged after behaviorism and is often characterized as having the therapist repeat back the words of the patient. If the patient said, “My mother wants me to buy a bazooka.” The therapist might respond, “Tell me why your mother wants you to buy a bazooka.” ELIZA could have this type of conversation. You can find a Python implementation of ELIZA here.
Modern mimicry agents use deep learning to learn from example conversations. They train on a bunch of dialogs and learn to generate the next statement given the last statement. There are lots of dialog datasets out there. You can use examples of dialog from movie and TV subtitles, such as OpenSubtitles. One can also use the Ubuntu Dialog Corpus, which is dialogs of people wanting technical support. Or one can mine Twitter and look for replies to tweets using the API.
The deep learning method used for this parroting is usually sequence-to-sequence models. A sequence-to-sequence model consists of an encoder and a decoder. The encoder converts a sequence of tokens, such as a sentence consisting of word tokens, into a single vector. The decoder begins with this vector, and it keeps generating tokens until it generates a special stop symbol. Note that the lengths of the source and target sequences don’t need to be the same. Both the encoding and decoding are done using recurrent neural networks (RNNs).
The initial big application for sequence-to-sequence models was language translation (Cho et al., 2014). For example, one would have a set of source sentences in English and their corresponding translations into Spanish. The model would then learn to encode the source sentence in English into a vector and then to decode that vector into the corresponding sentence in Spanish. To generate conversations, sequence-to-sequence models treat a statement as the source language, such as “How are you?” and they treat the response as the target language, such as “I am great!” An example is shown below.
The problem with sequence-to-sequence models is that they are devoid of meaning. They can work for just about any kind of problem that can be cast as translating one sequence into another, but they don’t make use of any special machinery or knowledge for language understanding. Another problem with using sequence-to-sequence models for chatbots (and using deep learning in general for chatbots) is that they are insensitive to specifics. If I ask it to buy me 14 chickens, it doesn’t treat the number 14 as particularly special, and it might be just as likely to buy me 140.
The most recent sequence-to-sequence chatbots use generative adversarial networks (GANs). GANs have a discriminator that tries to tell if the generated response was from a real conversation or generated by the model. GANs are all the rage in image processing, but they are still a work in progress for language.
Intention-based agents understand language as commands, and they use that understanding to perform tasks in the world. Examples of intention-based agents include Amazon’s Alexa, Google Home, Apple’s Siri, and Microsoft’s Cortana. Understanding what we say as a command language requires solving two problems:
- Identifying what the user wants the machine to do (the “intent”).
- Figuring out the details of the intent so the machine can take action.
For example, if we ask our assistant to play Steve Winwood, the assistant first needs to understand that we want it to play music (the intent), and then it must understand that, in particular, we want to hear Steve Winwood (the details).
Consider the example below about determining intent. When the user asks about chickens, the machine has to figure out which of its four capabilities matches the intent of the person.
The assistant can determine the intent using either keywords or text-based classification. To use keywords, you simply associate words and phrases with intents. To do text-based classification, you label a bunch of statements with the correct intents and then train a classifier over them. You can train a classifier using a bag-of-words representation with the Python library scikit-learn, as described here. If you have lots of labeled data, another option for learning intents is to use deep learning with a convolutional neural network (CNN) in TensorFlow. An implementation can be found here.
Once the agent has determined the intent, in our example it is to order groceries, it needs to convert the statement details into a machine-readable form, such as a Python dictionary, as shown below.
Having the command in this kind of dictionary form is called a frame and slot semantics, where the frame is the topic and the slots are the individual features and values. Once we have the machine readable form, the computer can do with it what it wants, just like any other instruction. This is natural language understanding.
One way to do natural language understanding is to use context-free grammars (compositional semantics). Context-free grammars consist of production rules that have a single nonterminal on the left and a string on the right side that can consist of terminals and non-terminals. To each of these production rules, we add a semantic attachment. An example set of rules is shown below.
Nonterminals begin with ‘$’. The first rule says that an order consists of an intent to purchase and an item and amount of that item. The semantics of that rule are combined via a dictionary union of the semantics of $Purchase (s) and the semantics of $ItemAmount (s). The next rule says that an intent to purchase is determined by the function is_purchase, as shown below. (This, of course, is a compact way to describe a set of rules. I.e., one with “get me” on the right-hand side, and another with “buy”, and another with “grab me some.”)
The next rule says that $ItemAmount consists of an $Amount and an $Item, which are defined in the following two rules. Those two rules make use of the functions get_item and get_number, shown in example form below.
Each of those rules has an associated semantics at the end. We use parsing to apply those production rules to text to get semantics. Parsing is done using bottom-up dynamic programming. Dynamic programming consists of solving subproblems and then reusing those solutions over and over again. The parsing process builds up the parse by looking at pairs of words, and for each pair, it loops over all n possible words in between. Because the parsing algorithm deals with pairs of sub-parses, the production rules above each have either a single terminal or two non-terminals on the right-hand side. If your grammar is not like this (but still has a single nonterminal on the left-hand side of each production rule) this is not a problem; you can always convert a context-free grammar to this form (called Chomsky Normal Form).
The figure below shows the order the parsing algorithm follows as it considers pairs of words: from ‘Get’ to ‘Get’, then from ‘me’ to ‘me’, then from ‘Get’ to ‘me’, then from ‘fourteen’ to ‘fourteen’, and so on.
The next two figures show the production rules invoked and the semantics at each time point, respectively. For example, at box 8 we are looking at the words from ‘fourteen’ to ‘chickens.’ This corresponds to the rule with $ItemAmount on the left-hand side, which comes from box 4 (the parse on ‘fourteen’) and box 7 (the parse on ‘chickens’). Box 9 brings together box 8 and box 3 to invoke the rule with $Order on the left-hand side.
Of course, the parser also tries many other combinations that don’t fit any of the rules. In general, if there are n words in the sentence and R production rules in the grammar, it tries n*n boxes and looks at n boxes in between for each, trying to apply each of R rules. In total, the running time is on the order of n*n*n*R. Often, there can be many valid parses, and a scoring algorithm can be used to choose which one to use.
There is code available in Python for creating grammars and running the parser as part of the SippyCup library.
Conversational agents expand on intention-based agents to have multi-turn conversations. To do this, they must keep track of the state of the conversation and know when the person wants to talk about something else.
Conversational agents aren’t yet common but imagine an agent like the one I described here. The agent would stay with a person for their whole life. It would start out as a companion when the person was a child, where it would live as a cell-phone app on the parent’s device. It would have a cartoon face, and it would learn about the child and teach her about the world. For example, if the agent knew that the child loved giraffes, it could use that to teach her math. “If you have 7 giraffes and bought 3 more, how many would you have?” It could also help with cognitive biases and over-discounting the future. “If you eat your chocolate bear today, how are you going to feel tomorrow when you no longer have it?”
The developers of the app would be working furiously behind the scenes, and the agent would become more sophisticated as the child grew older. When the child became an adult, the agent would become her operating system. It could provide turn-by-turn directions in life, such as guiding her through how to fix a sprinkler system. The agent could then serve as her assistant when she got old. Imagine that she is standing in the kitchen and can’t remember how to make coffee. The app could use cameras in her home and could guide her through the process, allowing her to live independently longer.
Conversational agents such as this one need a dialog manager to handle long conversations. A dialog manager must take the human through all of the things the chatbot wants to talk about. For example, it could first want to teach the child math skills, and then it could want to talk about the solar system. The dialog manager also needs to recognize when she wants to talk about something else and queue the current topics for later.
RavenClaw from CMU is probably the best-known dialog manager (see Bohus and Rudnicky, 2009). RavenClaw consists of dialog agents, which are little programs organized hierarchically that correspond to different bits of conversation. For example, there could be a general dialog agent about cooking that wants to discuss multiple things related to preparing food.
RavenClaw has a dialog stack and an expectation agenda. The dialog stack is a stack of dialog agents to keep track of all the things the chatbot wants to talk about. The expectation agenda is a data structure to keep track of what the chatbot expects to hear. For example, if the child says “yes” she probably isn’t answering a question from thirty minutes ago like Rain Man; the answer probably corresponds to the last question asked. But if she says, “I remember! Saturn is the one with rings!” she may be talking about a previous topic, and the expectation manager needs to match the utterance with all of the expectations to find the right one.
During dialog execution, RavenClaw alternates between an execution phase and an input phase. During the execution phase, it invokes the top dialog agent on the dialog stack and lets it talk. It also sets up what the machine expects to hear in the expectation agenda. During the input phase, RavenClaw processes what the person said and updates its knowledge, which it uses to determine which dialog agent can respond during the execution phase.
Consider the example below with a dialog stack and an expectation agenda.
- Chatbot just finished asking, “What is 4 + 5?”
- Child says: “Do you like Mr. Fluffles?”
- Chatbot responds: “Very nice. Is he your favorite?”
The chatbot is expecting to hear an answer to the math problem, which is why “9” is at the top of the expectation agenda. The child instead says something completely different, so the algorithm searches the expectation agenda for what she might be talking about. It finds that she is talking about toys. The chatbot then could move talking about toys to the top of the dialog stack, so it could generate an appropriate response, such as “Very nice. Is he your favorite?” The chatbot would also modify the expectation agenda to match that it expects to hear an answer to this new question.
RavenClaw isn’t based on machine learning, but we can alternatively use a learning-based approach to build conversational agents. We can treat creating a conversational agent as a reinforcement learning (RL) problem (Scheffler and Young, 2002). Reinforcement learning problems take place within a Markov decision process (MDP) consisting of a set of states, actions, and a reward function that provides a reward for being in a state s and taking an action a.
In the context of chatbots, states are things such as what the bot knows (questions it has answered), the last thing the bot said, and the last thing the user said. Actions are making particular statements, and reward comes from meeting a goal state, such as the child giving the correct answer to a math problem or, in a different domain, successfully completing a travel reservation.
We can use reinforcement learning to learn a policy that gives the best action a for being in state s. As one can imagine, learning a policy requires a lot of training, and it is difficult to pay people to sit and talk with the agent to generate enough experience, so researchers train their agents with simulated experience (Scheffler and Young, 2002).
There is an additional problem with using learning for chatbots; it is hard to know exactly what state the agent is in because of errors in speech-to-text or errors in understanding. For example, does the child want to talk about her toys, or is she telling the agent that it should be fluffier? This uncertainty calls for a Partially Observable Markov Decision Process (POMDP) (Young et al., 2013). In a POMDP the agent doesn’t know what state it is in and instead works with a distribution over the states it could be in. These types of agents are still experimental.
Conclusion: How well does practice match theory?
Let’s compare the current state of chatbots with our theory. We started with shared reference, which is making sure we are talking about the same thing. Shared reference is pretty easy to implement as long as we limit the discussion to a set of known things. Likewise, the shared conventions for word meanings can be programmed into intention-based agents or conversational agents. We then hit a wall.
The meanings that we code in are not mapped to a grounded sense of sensation or action in the agent. When we refer to an object, the agent has never held the object or used it like we would in the real world. This means that its understanding will be limited. I mentioned previously the idea of someone hurting their back carrying a heavy beam. A computer’s possible understanding of this scenario is limited to logical inference, which leads to two disadvantages. The first is that logical inference is surprisingly hard. Inference itself is well understood, but it is hard to set up the rules to make all relevant inferences. The second disadvantage is that the agent’s understanding will be at a low resolution. A computer can only understand variables populated with values — it can’t feel the pain. I wrote a post about how we could tackle this problem by having agents simulate experience in our world, but we have a long way to in this area.
More bad news is that the meanings that chatbots have are largely fixed, and so we currently can’t negotiate meanings with them in the discourse pyramid. There has been research into pragmatics, such as Grice’s maxims, but I don’t know of any work ready to be implemented on a general-purpose agent. The same is true for prosody. We have a long way to go.
In summary, the current state of the art is that we can build knowledge into chatbots that they can use to cooperate with us on tasks requiring minimal understanding. Building better chatbots will necessitate knowledge engineering and research into how agents can better follow and adapt to the subtleties of meaning and conversation.
If you prefer video, I cover much of the same content here.