Communication is hard (part 1)
Why it’s so tough to deliver a natural experience on voice assistant platforms
At Pylon ai, we’re working to build compelling experiences for the red-hot voice assistant ecosystem: Amazon’s Echo line, Google Home, Microsoft Cortana, and the flood of third-party devices trying to break into the market. We’re not alone in this; the manufacturers themselves are investing heavily in making their devices compatible with as many smart home gadgets as they can get their hands on, acquiring companies that build supporting technology, and partnering with content providers — even each other. There’s a burgeoning cottage industry supporting developers in this new space; Sayspring and VoiceLabs come to mind, but they’re not alone.
As we learned years ago when Apple debuted Siri on the iPhone and are in the process of learning all over again, though, these platforms tend to fall far short of expectations for a conversational partner. When Siri first hit the iPhone, it was mocked for its inability to understand requests that were elementary to its human users. Apple’s been improving both voice recognition and request parsing consistently, but that John Malkovich commercial might still be the closest we’ve gotten to a “conversation”.
In this series of posts, I’m going to talk a little bit about where I think the platforms are currently dropping the ball and walk you through a little of Pylon’s journey, missteps and all, to make it all work.
Do one thing, and do it mediocrely
The key problem I see in the current generation of tools from the voice platform leaders is that all they’ve really done is bring the telephone-based airline booking assistant from the 1970s and 80s into your kitchen. Sure, now the travel agent can turn on your lights, play music, and set your thermostat; but for all the deep learning that’s gone into their ASR (automated speech recognition) and NLU (natural language understanding) systems, they’re still aiming at the same target.
The “target” is the single-turn interaction. Ask a question (or give a command), get an answer, leave. Sure, sometimes you have to answer two or more questions to accomplish your goal (e.g., provide a restaurant, day, and time to book a reservation), but that’s still essentially a single interaction. This is perfectly fine for a lot of tasks, but designing the tooling and documentation to focus on this interaction model hamstrings the grand promise of a device that can understand human speech.
I’m not the first person on the Internet to bring this up (and I doubt that author was either), but I was complaining about it long before I read that article. Since these devices speak our language, we want them to understand us, and mostly they don’t. I do have a difference of opinion with that article’s author, though, and explaining it will help set the stage for some concepts I want to explore in future posts. It’s really more of a failure with the way these voice assistant developer platforms are presented than a problem with the author’s interpretation of them; I don’t think the users deserve the blame here.
Intents and slots aren’t a bad model
Here’s the relevant section from the Infermedica post:
Even when using a “custom skill”, you’re forced to enumerate a fixed set of intents that you allow the user to express. An intent might be a desire to order coffee or find movie titles […] Some intents may have slots for parameters, such as a city name or type of coffee […] It’s hard to imagine a casual chat having this level of rigor.
First, this is framed as a complaint about Amazon’s developer setup, when it’s really a universal thing. You’ll run into the same conversational primitives when you’re designing an action for Google Assistant or a voice-enabled Cortana app.
Second, I don’t think it is all that hard to imagine, at least not once you have a little background information. The companies might not come out and give you a detailed breakdown of where all their ideas came from, but the basic framework of classifying user input (“utterances”) into more abstract forms can be found in dialogue system literature at least as far back as 1980, in a paper titled “Analyzing Intention in Utterances”.  The authors of that paper (James Allen and Raymond Perrault) take a somewhat different approach to the classification than you’ll see in modern systems, though, and it’s a crucial difference.
Allen and Perrault talk about intents in terms of “speech acts”, a term coined and popularized by J. L. Austin and John Searle, respectively. Speech acts are a little more nuanced than this, but the basic concept’s easy enough to explain by example: saying “I christen this ship the S.S. Minnow” is not just a collection of words; it’s an act that names a boat (and someone with the right authority saying that phrase is what actually performs the naming, but that’s a slightly different topic). Saying “What’s the weather like today?” is the act of making a request for information, and it also tells the listener the topic of that request. Saying “I don’t like tomatoes” is the act of informing the listener about a preference you have. Small talk like “Great weather we’re having, eh?” states a belief about the outside world, and in certain contexts it might imply a desire to cause the person you’re talking to believe you’re a friendly person. It’s not too big of a stretch to cast even casual conversation in terms of how it communicates belief and desire via speech acts.
For the more rigorous, philosophical explanation of speech acts, see Searle’s famous paper What is a Speech Act?.  It’s a shame that this context has been lost, deliberately or not, because it leads to unfortunate advice like the following, straight out of Amazon’s documentation:
Intents represent the unique things that your skill is able to do. A skill for planning a trip might have five intents, for example PlanATrip, BookTheTrip, Stop, Cancel, and Help.
If you’re trying to have a conversation with the user, this isn’t a great way to organize it. We’ll get more into why I think that in a future post, but suffice it to say that speech acts have something to do with it. They give you a set of primitive operations that you’re going to miss designing everything ad hoc like Amazon’s recommending in their documentation, and they help you structure your thought about the user interaction.
Til next time
With that exposition in place, it’s time to move on to the real story here: How Pylon has embraced, massaged, and perhaps hyperextended the current generation of tooling in our quest to build a sentient android…I mean, help you cook dinner. In the next few posts, I’ll talk about our first conversational architecture, the lessons we learned developing it, and why we decided to start over from scratch.
- Allen, J. and Perrault, C. R. (1980). Analyzing Intention in Utterances. Artificial Intelligence, 15, 143–178.
- Searle, J. R. (1975). What is a speech act? In M. Black (Ed.), Philosophy in America. Ithaca: Cornell Univ. Press. Accessed at http://gbelic.org/courses/representation-language/readings/searle-what-is-a-speech-act.pdf.