Communication is hard (part 3)

Published in

Navigating the Conversation

8 min readApr 24, 2018

*original image credit:* Josh Reynolds/AP via *NPR*

Note: This the third post in a series that details Pylon’s journey to develop a better conversational experience. To read part 1 and part 2, well, click on those links.

To recap the problem we’re trying to solve: The voice assistant ecosystem is new (at least, the democratized version of it is; conversational agents have been around for decades in the form of phone trees and automated travel agents). Everyone’s still figuring out best practices, and the major players in today’s voice assistant market have started the developer ecosystem on a path that’s based more or less on sound linguistic footing. Their reams of documentation, though, sometimes give advice that oversimplifies the issues and can thus lead to maintenance problems for developers who want to create reproducible results across multiple “skills”. From the technical side, the mechanism for managing conversation state is largely left up to you as a skill developer*, and while that’s not necessarily a bad thing, it too can lead to maintenance problems since choosing a conversation model is an important decision that’s easy to mess up for conversations that last for more than a couple exchanges.

The first model we used at Pylon to mitigate this latter issue was the finite state machine. It’s a time-honored construct that’s long been used for conversational assistants, but it’s a bit ill-matched to the speech recognition capabilities of today’s smart speakers and the expectations of users who are (not unjustifiably) beginning to believe that they live in the future. So, what’s the next step here?

Well, if you’ve been following these posts, you may have gathered that I have a certain fondness for academia. If you haven’t been following them, then…see the previous sentence. Given that, there are a few different ways I could choose to build a dialogue manager:

Some of the older dialogue systems (and even some not-so-old ones, like UMBRA) feed intents into a graph of first-order logic propositions and constraints, often represented in a language like Prolog; the next system prompt is then chosen by a form of constraint satisfaction. This isn’t a bad approach, and I haven’t fully ruled it out as an area of research. Prolog isn’t necessarily known for its performance, though, and response times and scalability are major concerns for systems that live in the cloud and have to (hopefully) support thousands or millions of users.
A lot of current research effort is being dedicated to — you guessed it — neural networks. These systems, however, are often aimed at general conversation (as opposed to task- or domain-driven conversation) and require large amounts of training data. Their results are also sometimes a bit suspect. In general, they’re not yet up to the task of consistently representing a brand’s image and personality, which is another of Pylon’s key goals.
There are approaches that straddle the line between the rigidity of a fully deterministic system like an FSM and the Wild West feeling of a black-box neural network. Some systems use probabilistic rules to determine the current state. This is another approach that we haven’t completely rejected, but, again, the probability calculations remove a bit more control from the developer (and conversation author) than we’d like. In addition, one of the foundational assumptions mentioned in that paper is that the probabilistic aspects of the system could fill in gaps created by imperfect speech recognition. Bad ASR creates noisy data, so to speak, and a bit of Markov modeling can smooth it out. ASR still isn’t perfect, but it’s improved greatly since that paper was written.

The big reveal

So if it’s not behind door 1, 2, or 3, what is Pylon’s approach? What if I told you we decided to model the conversation as a series of states, transitions between those states, and…OK, just kidding; it’s not an FSM. We’ve taken the understandability advantages of finite state machines and, with the help of some concepts from AI and linguistic theory, adapted them into a conversational paradigm built around frames and speech acts.

Frames

Credit for coining the term “frame” as a formal construct often goes to Marvin Minsky, who described its use both in visual and natural language processing (we’re obviously more interested in the latter here) as a kind of theoretical skeleton for building out larger units of meaning. The idea gained traction in the semantics community, and researchers built a form of linguistic analysis called “frame semantics”. Much effort has been poured into creating computational resources like FrameNet to classify words into the frames they evoke and, thus, concepts related to them.

The term also has a preexisting meaning for dialogue systems, as Dan Jurafsky explains in these slides from a Stanford class that I linked in my last post.

The way we use frames at Pylon isn’t a strict interpretation of any of these traditions, though. In a Pylon assistant, a frame is a unit of conversation, roughly analogous to a collection of one or more states in a finite state machine. Each state, or node, inside a frame can be thought of as a “system intent”, or task the system wishes to accomplish; therefore, each node has one or more prompts (messages delivered in response to a user request) attached to it. These nodes are often linked to each other in a linear fashion, so that if a user says “next”, “keep going”, “tell me more”, etc., the system knows exactly where to go. Frames can be linked to each other in a similar way.

The main resemblance Pylon’s frames have to Minsky’s or Fillmore’s frames is in their relation to a conversation’s topic model. If a part of the conversation is focused on a specific entity (object), the nodes that discuss it with the user are typically grouped under a single frame, and that frame will be configured with the entity as its main topic. For example, The Bartender (live now on Alexa, Google Assistant, and Facebook) deals with cocktail recipes. There’s a part of the conversation dedicated to searching for a recipe. The Bartender asks the user what they’d like, performs a search and names the top result, and can also name subsequent results if the user doesn’t want what’s at the top of the list. This frame is named, somewhat predictably, “search”, and its main topic is “search results”, or a list of recipes retrieved from search. Modeling the topics alongside the conversation state like this allows us to do things like apply constraints to frames: if you haven’t decided which cocktail The Bartender should help you make, you can’t access the frame dedicated to guiding you through the steps of a recipe.

This is an organization you could achieve with an FSM, but we’re aiming for reproducibility here; now that this scaffolding is in place, we can spin up a new assistant very quickly compared to designing a new state diagram from scratch and hand-engineering the constraints related to each individual transition.

Speech acts

As I mentioned in a previous post, the theory behind intent classification (why you have to give Alexa/Google examples of phrases you expect the user to say and group them under named intents) can be traced further back to the idea of speech acts. The main insight is that when someone speaks a phrase or sentence, the words themselves carry meaning, but the utterance itself has an intent that represents a different level of meaning. Giving someone your email address, for example, communicates a piece of information, but (perhaps more importantly) also communicates your permission for another person to have this information and your desire that they remember it.

There are various taxonomies of speech act that academics have devised over the years, but there’s really only a small subset that we’ve found useful for conversational assistants:

navigate: Admittedly, this one doesn’t really show up in the literature, but we’ve found it a useful abstraction when dealing with conversational agents. When the assistant’s side of the conversation boils down to a choose-your-own-adventure script, many things that the user says can be interpreted as “go to page x” (to continue the metaphor). “Start over” navigates to the beginning of the conversation; “make this one” navigates to the frame that teaches you how to mix a cocktail, etc.
request: The user is asking for a piece of data. “I want a Tom Collins” is a request to search the recipe database for a drink with that name. This speech act is also used to handle requests like “what did you say again?” (a request to repeat the last prompt) and “I need some help” (which should be fairly self-explanatory).
inform: The user is giving us a piece of data that they want us to store. This can be used to store things like user contact info, preferences, etc.
accept/reject: These are used to handle proposals made by the system during the course of the conversation. If a node’s prompt ends with a yes/no question (e.g., “Does this sound like something you want to make?”), the results of a positive/negative answer will be configured right alongside the prompt; otherwise, the contextual meaning of “yes” and “no” get lost entirely.

These basic speech acts form the foundation for our intent naming scheme, which can be described roughly as act.frame.node or act.data, the former used for navigation and the latter used for requests and informative utterances. Again, note the reproducibility: We’re not deciding on an agent-by-agent basis which actions are supported and giving our intents names like PlayMusicIntent or OrderPizzaIntent (seriously; why has the common advice been to end your intent names with “intent”?); we have a set of well-known actions that have meaning across different types of skills, and the rest of each skill’s intent names is explicitly tied to the conversation’s structure.

And that’s just about it. To summarize, we’ve replaced our previously ad hoc conversation state and intent model with a system that’s reproducible and sensitive to the linguistic context that sets interactions with a voice assistant apart from those with, say, a vending machine. Our conversations are composed of a collection of frames, each in turn a collection of nodes focused mainly on a single topic. Transitions among these nodes and frames are governed by a well-defined set of universally supported speech acts and constraints related to the topics covered in the conversation. It might be a mouthful to describe if you’re not already familiar with the terminology, but once you’re past that part, fitting a conversation flow in your head (and reading one on a screen) becomes much easier. What this model gives you is, well, a frame of reference for organizing your thoughts about your second, fifth, and tenth assistant model.

There are more configuration and customization features that set Pylon’s platform apart from some of the other organizational schemes we’ve seen in use, but I didn’t want this post to read too much like a documentation page. If you’re interested in hearing more, though, we’d love to hear from you! Head over to our site and get in touch!

* A notable exception to this might be Microsoft’s newest development framework for Cortana skills, which does allow for persistent conversation storage. It has its own problems, though, and I don’t want to spend too much time on it here.

Communication is hard (part 3)

The big reveal

Frames

Speech acts

Written by Josh Ziegler