Communication is hard (part 2)

or, How our first approach to conversation management caused as many problems as it solved

Published in

Navigating the Conversation

10 min readJan 22, 2018

“Here I am, brain the size of a planet, and all you let me do is follow these silly rules.”

In my last post, I said that the current generation of voice assistant platforms (Alexa, Google Home, etc.) are optimized for “conversations” that are extremely basic. For example, “Order my favorite pizza from Domino’s” works great, while, “Help me find a great gift for Mom” does not. The promise of voice assistants — and the expectation of consumers — is that more difficult requests will be resolved by conversational agents. At Pylon, we’re building the technology that can enable these assistants to address more robust queries from consumers.

In this post, I want to walk you through our first attempt at Pylon to make a longer, more nuanced conversation work and some of the pitfalls we encountered with our initial architecture. This will get a little technical, but I’ll explain the concepts as I go, so don’t worry about getting lost if you’re not an AI researcher or software engineer.

It’s something of a long read, but the organizational approach I’m talking about here is a fairly popular one. If I’m going to try to poke holes in it (and believe me, I am), it deserves a full treatment. By the end, I’ll have set the stage for introducing Pylon’s new approach in a future post.

Inspiration + Instruction: How hard can it be?

When we started, researched showed that 51% of Echos were in kitchens, so we assumed a conversational agent that helps you get dinner on the table might be a good way to reach customers. Our assistant works with you to figure out what to cook and then walks you all the way through preparation. It’s called “Tasted", and it’s currently available on Amazon’s Alexa Skill Store, Google Assistant, Facebook Messenger, and Slack. Here’s a video overview of the concept:

A conversation with Tasted consists of two main tasks: choosing a recipe and preparing it. When you’re actually building the assistant, though, these tasks have to be broken down into individual interactions:

Greet the user
Let them search for a recipe (or make a suggestion to start the conversation)
Present search results
Handle the user’s selection from those results
Give more detailed information about a recipe if the user requests it (ingredients, an overview of preparation, and special equipment required)
Walk them through each step in preparing the dish

In addition, the user may want a recipe’s ingredients sent via text message, or to save the recipe so they can make it later. After adding in support for some common phrases like ‘yes’, ‘no’, ‘next’, ‘go back’, etc., we’re somewhere around 15–20 different actions the user might want to tell us to perform (or, to use the industry term, 15–20 “intents”). Given the various information we need to collect from the user during the course of a conversation, there’s somewhere on the order of 10–15 distinct states that the system can be in at any given time (about to search for a recipe, presenting the results of a search, cooking a recipe, etc.).

Currently, NLU platforms, like the one provided by Amazon as part of Alexa, or Google’s subsidiary Dialogflow, do not provide much help managing these different states. Amazon does have the concept of “session attributes” that can help you manage state, but you’re required to set them and read them entirely in code; the developer tools won’t help you with them. Dialogflow has “contexts” and “follow-up intents” that you can set in their UI, but you’re still forced to manage everything as if intents were the main component in the conversation, and keeping all the input/output contexts straight in your head can become its own problem as your conversation gets more complex. Again, “intent” is just a fancy word for “what the user just said”, which is going to be dependent on both what’s happened so far in the conversation and what the user wants to accomplish next.

It makes sense why NLU platforms wouldn’t dictate how a conversational developer might manage a conversation’s state: It’s a hard problem for complex conversations, and it’s not much of an issue for simpler ones, so just let the developer deal with it.

To recap: We have states, and we have intents that move us from state to state (saying “search for a chicken recipe” takes the user from the “about to search” state to the “here are your search results” state). This is starting to sound like a job for a finite state machine, right?

As it turns out, maybe it’s not. The reasons for that are a bit technical, though, so you might want to grab a whiskey (or a coffee) while I tell you why organizing our system this way was a costly mistake.

You say the states are finite, but they don’t feel like it

Just in case you were following right up until I said “finite state machine”, here’s a quick crash course. Finite state machines (FSMs) are common, relatively simple computational models consisting of states and transitions that connect those states — all of these states might be interconnected, but not necessarily.

For example, a standard vending machine can be thought of as an FSM. It starts out in the “waiting for money” state; when a user inserts money, it transitions to the “waiting for a selection” state, and so on. If the user makes a selection before inserting money, no transition happens. Those are the basics — user actions create transitions, but not all transitions are valid, depending on the current state of the system.

A quick terminology note: When I say “FSM” in this post, I’m actually talking about nondeterministic finite automata (NFA). The “nondeterministic” gets thrown in because states in NFAs can have transitions that point back at themselves, which means you can’t determine ahead of time the maximum number of transitions that can take place between the start and end states. We need these repetitive sorts of transitions, though, because we need to support things like the user saying, “Could you repeat that?”. Here’s a picture of a relatively small NFA:

Managing a dialogue with one of these should be easy, right? Plenty of people have thought so; a quick Google search will lead you to a small army of tutorials and even a couple FSM libraries integrated with Alexa boilerplate to help you directly hook the FSM up to requests coming in from Amazon. In fact, if you hit upon just the right combination of search keywords, you’ll end up at this slide deck from a CS course taught by Dan Jurafsky, co-author of Speech and Language Processing, which is essentially the Bible of introductory NLP (and then some). The deck is a great overview of some popular dialogue agents throughout history and the basic concepts in play, but I mention it here because slide 13 nails the problem with FSMs in far fewer words than I’m using: “too limited”.

Let’s elaborate a bit (…more). There are different ways to deal with things like knowledge of the outside world, user input processing, etc, but the basic analogy between an FSM and a conversation is:

States => Things the system says
Transitions => Things the user says

You probably have a natural “flow” in mind for your conversation — a way that makes the most sense for users to interact with your agent. Do this, do that, do a third thing … DONE! It wouldn’t make sense for a user to say “let’s cook it” right after your system welcomes them, for example (“let’s cook what?”). So you only need to handle certain utterances at certain states, which saves you the work of making all those transitions…

Here’s the problem with that: You know what state the conversation’s in, but does your user? They don’t have access to the road map that is your state diagram, so it’s entirely possible they’ll say something that doesn’t “make sense”, perhaps because their brain is still processing what you said at the last state they were on, not this state.

Also, people change their minds. A user could get all the way to cooking a linguini recipe and say “You know what? Find me something with penne instead.” It’s looking like we’re going to have one of those almost-fully-connected state machines after all; if we don’t, our conversation is going to feel rigid, and our agent’s going to seem dumb — or worse, unfriendly.
Before you know it, your conversation has gone from something like the neat little graph above to something more like this:

Notice how that’s messy enough that the visualization software gave up even trying to draw smooth lines for some of those transitions. It’s less “finite state machine” and more “flying spaghetti monster”. Of course, we’ve fixed this mess by now, but that’s a story for another post.

Wait … you want that intent to do something different?

Sure, you have clean, perfectly normalized database models backing all your prompts, transitions, conditions, and the back-end actions resulting from each intent. Inevitably, though, you’re going to have to change something about the conversation, and that means editing your user interactions via SQL queries. You could put a simple CMS interface in front of your database, but I wouldn’t recommend it. That’s likely to turn your visually informative state diagram into a spreadsheet, and you’re going to get a fresh tension headache from your eyes darting up and down trying to trace an imagined conversation flow. The system we started out on had a CMS by default, and we only used it once or twice for these very reasons.

Our primary editing interface ended up being several people making conversation suggestions to an incredibly patient, competent, handsome engineer (who’s totally not the one writing this post). That engineer then edited some bespoke YAML files that had been lovingly hand-crafted from colons, dashes, and whitespace while questioning all his life decisions. The YAML represented system speech (several variations for each state of the FSM, based on the conversation context/user profile/user device at runtime) and the results of each user intent for each state. The database models were extracted from it by a simple(ish) import process. By the time we finally decided we’d had enough of all this, the YAML had accreted into over 4,500 lines of unmaintainable horror, and we started thinking there had to be a better way.

Enough already!

Before we go into the solution (or, I should say, pitch to the post that will talk about the solution), let’s distill the problems we encountered using an FSM for conversation management:

Configuration gets redundant

Part of our configuration maintenance problem came from not realizing up front that almost all user intents need to be “allowed” at every state to accommodate the way human conversation actually works. We could have made a separate YAML file that listed such “global” intents, but then we’d have had to duplicate the state names all over the place as we discovered, for example, that we’d need to handle the phrase “cook it” differently at the search results state than at the recipe info state. Such a refactor might have helped maintainability, but it wouldn’t have done anything for the next problem.

Conversation isn’t linear

“Neither are FSMs”, you say. “Just look at that first picture.” That’s true, but FSMs are a better fit when you only want to support a limited number of transitions in your interaction and actively forbid the user from taking certain paths (to push them down a mostly one-way path), not when you’re trying to cooperate with a fickle human to accomplish an evolving task, and trying to let that human more or less define that task (or think they are, at least).

Subtasks are unnatural

I haven’t really touched on this at all so far, but sometimes as a conversational developer you end up with interactions that involve more than one question and answer (a conversational “turn” in NLP parlance) but aren’t really part of the “main” conversation. Maybe these interactions are optional, maybe they’re required; either way, you want to end up on the same state in the “main” flow when you’re done with the tangential one. You’ll likely end up modeling this as some kind of stack, with the subtask being its own FSM. Voilá: mitosis for your maintenance problems.

It’s all in your head

Admittedly, a large part of the problem here is mental modeling — starting with an FSM as your architecture can steer you toward certain ways of thinking about your interaction and how it “should” work. You can preempt a lot of these issues if you know about them ahead of time and adjust your FSM to fit your conversation rather than the other way around. But we don’t think you should have to.

At Pylon, we opted for another approach: Start over. Read and re-read some of the academic work out there on dialogue management (surprise: There’s a lot of it, and it’s kind of popular right now). Reframe the conversation model, starting from the concept that the user can take any action at any time.

In the next post, we’ll spend a little more time talking about this new way of modeling a conversational agent. If you got all this way and were expecting the last section to solve all your problems, my apologies. As consolation, I offer you the sympathy of someone else in the same position and a commitment to be more constructive the next time around.

Of course, I’d be remiss if I didn’t mention (again) that this is a tough problem. If you’re a company looking to enter the space but don’t want to get bogged down in configuration management, shoot Pylon an email; we’d be happy to help.

Next: Communication is hard (part 3)