Thinking for Voice: Design conversations, not logic

“Thinking for Voice” is about more than coding, it goes to the heart of how we design the conversational experiences. Most programmers approach this task the way we’ve approached coding tasks before, by focusing on how we ask for information and then process it. But this is, in some ways, completely backwards. When we think for voice, we need to focus on the conversation above all. This article should help you understand why, and how to do it. Future articles will build on this to see how we turn that conversation design into code.


As coders, our habit (and often our training) is to look at problems and break them down into smaller parts. We also want to identify parts that are similar, so we can turn them into functions that will be called from different parts of our code. Often we’ll use logic flow diagrams to help figure out how our code should flow from one part to the other, what decisions we need to make at each point, and how to reply. We typically tie the replies we make to the logic and the common functions as well.

When building voice applications, this design model can get in the way. This is because human conversations don’t break down in the same sorts of ways, don’t necessarily have the same logic flows, and often have replies that need to be more varied and situational.

We’ll take a look at an example problem and how we might have classically approached designing it, and why it leads to problems with conversational design. We’ll then look at a better approach and see why it makes our design, and eventually our code, much easier. In future stories, we’ll show how to take this design and turn it into code.

Our requirements are fairly simple, an action that can provide information about a letter or a number. In this case, we’ll mostly be providing trivia, but in the future we’ll see how we can expand this to provide less trivial things.


Don’t go with the flow

Our first instinct would be to build a diagram that breaks it down into the essential components. Probably a flowchart like the one you see here. In it, we ask a few questions, make a few decisions based on user input, do a bit of logic to figure out what to say, say it, and then prompt the user for what to do next. I’ve circled a few bits that seem to logically break it down in terms of functions.

Just from looking at it, there are a few things that we might identify as a problem, however. Functions like N1 and L1, and N2 and L2 seem similar to each other, but they’re different enough that we might not be able to handle them in identical functions.

Even more confusing for us as we would begin to implement this is that we’re going to send a reply to the user in N1, but then immediately send another reply as part of N2, before we wait for the user to say anything back to us. Our conversational tools don’t really let us do this, so we try to find workarounds that let us do so. Most developers latch on to a concept called a “Followup Intent”, since it sounds like it represents how we want the conversation to flow.

A similar problem rears its head when we think about how to handle the “No” response to N2 and L2. Both of them want to go to the same function (C1), but if we try to use a “Followup Intent” for them, we discover that we can’t have the same intent be the followup from two different ones. So we seek workarounds for this as well.

There are more subtle problems with our logic flow as well, that we may not encounter until after we build it. What if the user says something nonsensical in either C1, N2, or L2? How should we behave. What if they keep saying something nonsensical, at what point should we give up? How do we draw that into the logic? And what is nonsensical anyway? If we ask for a letter after we get a number, that seems like a pretty common thing for humans to say in conversation, but our logic can’t handle that. How about quitting? And our message in C1 says “welcome”, can we actually change that if they return back to the question?

We can do all of these, but to capture them, the logic diagram gets very very complicated. Is there a simpler way to think about it? (Spoiler: yes.)


Designing the conversation

What if, instead of focusing on what our program does, we focus on what our users are saying when they talk with us. This conversation first approach can more easily uncover some of the issues we saw at the end of the last session. Furthermore, it keeps things fairly simple, since our design can focus on things one at a time.

We can restructure that flowchart as a number of conversations. We’re just going to go through a few conversations that are on the “happy path”, but keep in mind that a real design would have many more of these scripts exploring other things that might happen. Later on, we’ll annotate the conversation to see what we do at each point.

Script 1:

User:  Talk to number or letter
Agent: Welcome! Would you like to hear a number or letter?
User: How about a number
Agent: 1 is the loneliest number. Would you like another?
User: Yes
Agent: 2 is the only even prime. Care for another?
User: No
Agent: Ok, would you like to hear a number or letter?

Script 2:

User:  Talk to number or letter
Agent: Welcome! Would you like to hear a number or a letter?
User: Number, please
Agent: 42 is the answer. Another one, so to speak?
User: sure
Agent: 21 is blackjack. Care for another?
User: nope
Agent: Ok, care for a letter or a number?

Script 3:

User:  Talk to number or letter
Agent: Howdy! Would you like a letter or a number?
User: Let's go for a number
Agent: A googol is 10 to the 100th power. Ready for another?
User: How about a letter this time
Agent: The letters "A" and "I" are also words. Want another?

There are a few points to learn from this:

  • People respond in many different ways when asked a question. The tools we will use collect many sample phrases, plus mix in some ML algorithms, to match what the user says to how they can reply.
  • In real conversations, users can take the conversation in any direction. So although we’ve asked a yes or no question, they may try to take the conversation in a whole different direction instead.
  • How we reply depends on two things: 
    (1) What state we were in and 
    (2) What the user says
  • A consequence of point (1) above is that we should keep track of the user’s state to determine what we say, so the new state becomes part of the reply, even if the user doesn’t see that.

We commonly break these conversations down by what the user says. These are called Intents, since they reflect what the user is intending to say or do, rather than the specific words they might use. Intents are never about what our code will do, instead, they focus on one of the most important points in any conversation — what the other person is saying to us.

With the above points we learned, let’s add a little more information to the conversation, breaking it down by what the user says at each stage. We will add what Intent might get matched and then what our code would do — both in terms of state set and the reply sent.

Script 1:

User:  Talk to number or letter 
Match: intent.welcome
Logic: Set replyState to "prompt"
Pick a response for the current replyState ("prompt")
and the intent that was matched ("intent.welcome")
Agent: Welcome! Would you like to hear a number or letter?
User:  How about a number 
Match: intent.number
Logic: Set replyState to "number"
Pick a response for the current replyState ("number")
Agent: 1 is the loneliest number. Would you like another?
User:  Yes 
Match: intent.yes
Logic: Pick a response for the current replyState ("number")
Agent: 2 is the only even prime. Care for another?
User:  No 
Match: intent.no
Logic: Set replyState to "prompt"
Pick a response for the current replyState ("prompt")
and the intent that was matched (not "intent.welcome")
Agent: Ok, would you like to hear a number or letter?

With this, we can see that our replies are based on a combination of the current state and the user’s intent. (Our state could be more complex, to keep track of what the user has heard, how many times they’ve visited, etc. This is very simplified.)

We also see that “yes” doesn’t change the state. It doesn’t need to. Some of our processing doesn’t care what intent triggered that, and that’s ok. Others might vary the conversation slightly by saying “welcome” only when we start the conversation.

Finally, we’ll also note that we’ve combined our replies (the information and the prompt) into a single thing sent back. This is how back and forth conversations typically work we reply only when our conversational partner says something first.

If we look at script 2, we’ll see it plays out identically:

User:  Talk to number or letter 
Match: intent.welcome
Logic: Set replyState to "prompt"
Pick a response for the current replyState ("prompt")
and the intent that was matched ("intent.welcome")
Agent: Welcome! Would you like to hear a number or a letter?
User:  Number, please 
Match: intent.number
Logic: Set replyState to "number"
Pick a response for the current replyState ("number")
Agent: 42 is the answer. Another one, so to speak?
User:  sure 
Match: intent.yes
Logic: Pick a response for the current replyState ("number")
Agent: 21 is blackjack. Care for another?
User:  nope 
Match: intent.no
Logic: Set replyState to "prompt"
Pick a response for the current replyState ("prompt")
and the intent that was matched (not "intent.welcome")
Agent: Ok, care for a letter or a number?

In fact, if you look at the “Match” and “Logic” parts, they are identical conversations as far as the program is concerned. The only difference is the exact words the user used and the exact phrases we replied with. They are, fundamentally, the exact same conversation.

How does this play out in script 3?

User:  Talk to number or letter 
Match: intent.welcome
Logic: Set replyState to "prompt"
Pick a response for the current replyState ("prompt")
and the intent that was matched ("intent.welcome")
Agent: Howdy! Would you like a letter or a number?
User:  Let's go for a number 
Match: intent.number
Logic: Set replyState to "number"
Pick a response for the current replyState ("number")
Agent: A googol is 10 to the 100th power. Ready for another?
User:  How about a letter this time 
Match: intent.letter
Logic: Set replyState to "letter"
Pick a response for the current replyState ("letter")
Agent: The letters "A" and "I" are also words. Want another?

Here, the user has suddenly requested we jump to an entirely different state. But that isn’t a problem — our program just saw this the same as if they requested that state from the prompting question and the handler for that reacts the same way.

So instead of having to build in many followup intents, we aim to capture what the user is saying, and then use our webhook to change the state based on that.


What happens next?

If we were doing a more full-featured design, we would add conversations that reflected other “not so happy path” elements such as

  • What happens when the user asks to quit at any point?
  • What if they say something nonsensical? What if they keep saying nonsense?
  • If they don’t respond to us, how should we behave?
  • How do we handle running out of new factiods to talk about?

We might even consider additional features:

  • Can we add other categories of information besides letters and numbers?
  • Might users want to ask more information about our most recent result?

What we would see is that all these cases can be handled the same way as the cases we already reviewed, by possibly changing state and then doing something based on the current state and the Intent that was just triggered by the user.

Once we were done with the design, we would want to code this, of course! We’ll start to look into how we do this next.

Based on a question and answer at stackoverflow.com. My thanks to user Stormsson for the question and their permission to expand on it for this story.