A Conversational Design Primer
Looking to get started in voice or chatbot design? Learn the basic terminology and concepts to empower better design decisions.
As we used to say at Amazon, the tech industry has passed through the threshold of a “one way door” with regard to conversational experiences. Our industry will never look back at the world of purely point-and-click websites as the end-all and be-all of customer experiences.
My own path in the space of both voice UI and conversational design has been a long and winding road: from video games to CRM cloud services; from Cortana to Alexa. In 2017, I started a new chapter in that career, sharing the knowledge I’d gained along the way with others via workshops (Giving Voice to your Voice Designs) and talks (Blank Page to World Stage, The Future of Voice).
Now that I’m focused on my work as Principal Designer and owner of Ideaplatz, I’d like to share with you the introductory primer to the key concepts in the conversational design space: the primer I wish I’d had when starting out on this modern wave of conversationally focused experiences.
Not sure whether voice or conversation is right for you? You may want to start with one of my introductory Medium posts: Voice User Interface Design: New Solutions to Old Problems.
Let’s talk about the past
Up until the advent of dedicated voice experiences on the iPhone, conversational interfaces fell into one of two categories:
Products like Dragon Naturally Speaking, which required considerable training and were about transcription, not transactions. (This line has blurred in the ensuing decades.) These systems can transform the spoken word into digital text, but were not optimized for taking action on the spoken word. They also required quite a bit of user-specific training to get their accuracy to acceptable levels.
Best for command-and-control scenarios, grammar-based systems know a fixed dictionary of terms and will match speech to the closest option within that dictionary. Many early voice-enabled toys and video games like Hey You, Pikachu and Disney Friends (disclaimer: I was the Lead Producer on Disney Friends) were grammar-based.
The key shortcoming of grammar-based systems is that they are inherently unforgiving. If a customer changes the order of the words in their request in an unexpected way, or otherwise contravenes expectations, the system can’t adapt and will often take the wrong action as a result.
In many ways, these shortcomings can be mapped to our own learning frameworks as children. When we are young, we only know a few words. Words that sound like words we know may be miscategorized because we don’t know how to adapt yet.
Connectivity changes everything
Once cloud services became a reality, everything changed. You see, to go beyond the dictionary approach, we needed to teach systems how to extract meaning from words. Not just to match sounds to letters, but to apply the semantic rules within a chosen language to understand the difference between similar words, and to understand that different phrasings sometimes mean the same thing.
Consider this pair of examples:
- “Computer, turn on the lights in the play room.”
- “Computer, play ‘Turn it On Again’.”
A system that was just looking for words that sound similar might get confused here. “Turn”, “on”, and “play” are all key words present in both phrases. Think about how we as humans distinguish between the two. It’s actually pretty complicated, isn’t it? Part of it is ordering, part of it is additional context (like the word “room” — which might not be present if I asked for lights in the basement), and part of it is those tricky linguistic connective tissues like “in” and “the”.
This is the difference between early voice recognition systems and a natural language recognition system — the ability to go beyond sound and understand the underlying meaning in a customer’s request. It’s a complicated problem, which is why we needed artificial intelligence to solve it.
The language behind natural language
If you’re coming from a world of more traditional, visually-oriented design (and almost all of us do) — working on conversational designs will mean familiarizing yourself with the terms of the trade. You may encounter a new type of collaborator in linguists or speech scientists, who are the individuals tasked with teaching your artificial intelligence solution about the semantic meanings specific to your product, service, or feature.
The utterance is the “ground truth” about a customer’s request; it is the specific way in which a request is posed to the system. For chatbots, this is typically a text string; for voice-based systems, it may be helpful to think of this as the actual recording of the request. This utterance may include typos, grammatical errors, ambient noise, or interruptions — whatever actually happened at the time of the request.
Conversational designers use the term “intent” to signify the customer’s goal when making a request. Many utterances may correspond to a single intent. For example, a thermostat may have an intent model to represent a customer’s desire to make it incrementally warmer in the room. The following utterances could all be mapped to Thermostat/Warmer:
- Make it warmer in here.
- I’m cold!
- It’s cold in here.
- Turn up the heat.
Interaction designers and researchers are often responsible for examining potential customer intents and providing recommendations to speech scientists. During this process, the design team would also provide sample utterances like these for each intent to get things started.
A slot is essentially a conversational variable, for those of you with a programming background. For the rest of us, slots are parts of an utterance that we expect to vary from request to request. A common example is weather. Consider the following request:
Computer, what’s the weather going to be in Orlando on January 9?
In this example, most of the utterance is unlikely to vary much, though you might see discrepancies in ordering. But the bolded text indicates slots: places where we expect the content to vary almost every time.
Intents often depend upon the content of a slot to complete the request. For a “Weather on specific day” intent, we would expect at minimum a date; and optionally a location we’re curious about.
This is where our industry terminology starts to feel a bit needlessly obtuse. Entities are a concept that almost everyone seems to understand, but no one can describe; maddening for those coming in from the outside.
IBM’s developer documentation defines entities as “usually a classification of objects aimed to help alert the response to an intent.” Which is… not particularly helpful.
Essentially, entities are a model of the concepts important to your product, and how those concepts relate to one another. You might start modeling your entities by drawing out a conceptual map of the terms your customers must deal with, and filling in the relationships and values.
In most systems, you can define your own entity types. For example, when I was prototyping a Microsoft Azure onboarding chatbot during my latest stint at Microsoft, one entity I defined was an “operating system”, and that entity could have values of “Windows,” “Linux”, or “Mac OS”.
But in many cases, the values in our slots correspond to very well-understood entities, like time or city name. In other cases, the value maps to a massive catalog of slot values, like musical artists. In those cases, designers don’t usually model the entities themselves.
Slot Types (aka System Entities)
This concept goes by several names in the industry, but in short a slot type is a hint to our natural language system to apply additional logic to the bit of utterance in that slot. For example, Alexa allows you to define a slot type of “Date”. Any utterance processed as AMAZON.Date is processed based on Amazon’s extensive experience. Slot types are often very forgiving: for the case of Date, it can handle a range of utterances like “January 9”, “January 9 2019”, “The 9th of January”, “January”, etc.
Every system comes with its own system entities or slot types, so your mileage may vary; there is no universal set of concepts. Dates, times, cities, colors, and numbers are some of the most common slot types. For further examples, start with this Amazon Slot Type reference or Dialogflow System Entities.
The text of a response to be delivered back to a customer conversationally on behalf of the system. “Prompt” sounds like it’s asking for something, but that’s not necessarily the case. Some systems use terms like “response” instead to avoid this issue. But note that we said text of a response. What if your response should be spoken?
Text to Speech
If you’re building a voice-enabled system, it’s a generally accepted best practice to ‘respond in kind’. That is, speak when you’re spoken to. But most prompts start out as text. 5 years ago, most spoken prompts required a recording session with a voice-over artist, resulting in MP3s that could be played back. That doesn’t scale to a huge problem space, like including all possible musicians and song titles.
Alexa, Google Home and Cortana have all moved to using a text to speech system, or TTS. These used to sound very robotic, but proprietary advances in technology have allowed these systems to generate arbitrary audio prompts very convincingly in real time — as long as there’s a functional Internet connection to transmit the resulting audio file.
Conversation or not?
You’ll notice that I often specify “voice AND conversational” design, or differentiate between the two. This is because the two aren’t quite equivalent, at least as the industry sees them.
- Conversational design can apply to BOTH text-based chatbots AND voice user interfaces.
- Voice user interface design refers ONLY to experiences where the input (and usually output) is audio-based, or spoken.
An experience designed for voice can usually translate back to a traditional chat medium, but an experience built for chat is NOT necessarily going to succeed over voice. This is because these two modalities engage different parts of the human brain: visual memory and processing are fundamentally different from auditory memory and processing. A good foundation in cognitive psychology will go a long way for designers asked to straddle this divide.
This distinction is a large part of what I cover in my workshop, “Giving Voice to your Voice Designs”. It’s also why the Twitter hashtag #VoiceFirst has gained such traction. The movement isn’t about ONLY interacting via voice, so much as it is starting from the most difficult and restrictive interaction model, and moving out from there.
The systems behind conversational understanding
As someone who’s worked for years in spaces considered by the outside world as “artificial intelligence”, I’m often asked about the robot revolution. When will SkyNet take over? I usually reply with the observation that I feel the singularity is overhyped at best. Most systems we perceive as a singular, unified intelligence (like Alexa and Cortana) are actually a series of disparate services on the Internet communicating in real time. If any link in this chain fails, our ability to understand and respond is limited, or completely removed.
So what are these disparate systems? Let’s dive in.
Automatic Speech Recognition (ASR)
For voice controlled systems, automatic speech recognition is the first and most rudimentary step in the process — not so much artificial intelligence as a processing step.
ASR systems take the spoken utterance from the customer (ie, the waveform itself) and chop it up into individual segments called phonemes. A phoneme is defined by Merriam-Webster as “any of the abstract units of the phonetic system of a language that correspond to a set of similar speech sounds.”
ASR systems don’t understand sentence structure, but they do understand some basic fundamentals about their assigned language: for example, K and Z are unlikely to appear adjacent to each other in English text, so we can rule those guesses out.
The output of the ASR step is a first guess at the customer’s utterance. Since we don’t have the full context, this guess might change. But it’s enough to move on to the natural language understanding system.
Natural Language Understanding (NLU)
Natural language understanding systems are the real artificial intelligence behind your favorite conversational systems. NLU engines take text as input — either directly from chat, or the output from an ASR system if speech is involved.
From that starting point, a natural language understanding system attempts to map the utterance to an intent. Think of the NLU system as working to answer these three questions:
- What does the customer want to accomplish? (Intent)
- What’s unique about this request? (Slots)
- Is there anything in this request I need help understanding? (Entity Recognition)
For voice recognition systems, sometimes the NLU system decides the ASR output doesn’t make sense… but it might be close. In these situations, NLU might send an utterance back to ASR with additional context to check a hypothesis. If you’ve ever used Siri and noticed that she erased her transcription of what you said and replaced it with something more accurate — this is what happened.
Entity Recognition (ER)
If our utterance contains a non-standard slot type, our utterance might be processed by a separate entity recognition engine. For finite sets, the entity recognition might be trivial, but usually still technically separate.
In cases where our slot is expected to contain a reference to a giant catalog of possibilities, that piece of the utterance is often sent over to an entity recognition system. In some cases, these are run by different companies entirely. For example, Nuance Communications is a company that has helped many speech systems by providing an entity recognition service for musical requests. This is harder than it sounds, when you consider that the catalog of available music is literally getting larger day by day, and will continue to do so until the end of civilization. And don’t get speech scientists started on artist names like Ke$ha. Entity recognition engines often account for these sorts of challenges.
This isn’t a formal system, per se, but I wanted to make the point that your system’s response to an intent is completely separate from the processing and identification of that intent.
Most conversational services out there focus on intent processing, but the business logic to respond to an intent comes down to traditional programming, though often a serverless solution like Lambda or Azure Functions.
For example, my Trainer Tips skill has two main components: the input processing, via Alexa’s Skill Kit; and the business logic, hosted on AWS Lambda, which generates the prompts for each intent and sends them to Alexa’s text-to-speech engine.
Next steps on Medium
With regard to conversational UI, I’m workshopping a few future pieces on the dangers of AI-powered experiences, conversational error patterns, and beyond.
A great way to get started with conversational interfaces is to build your own chat app for fun, and see what you learn. I’m hoping to put together a chatbot tutorial for designer/developers, likely using LUIS.ai and Azure Bot Service. For those less experienced on the development side, Alexa skills have spawned a cottage industry of helper apps, tutorials, and templates to get you on your way.
If you’re not yet ready to dive in headfirst but want more context, I have a written a wide variety of Medium articles exploring voice and conversational design, as well as some posts about more general product design topics. Peruse them all at my profile page, and follow me to get updates when new articles are available.
Best of luck with wherever the conversation takes you. May the voice be with you.
Cheryl Platz is a world-renowned expert on voice user interfaces, conversational user interfaces, design for artificial intelligence, and design for large scale and complexity. Her insights have been cited by major outlets including BBC Radio, Wired, Forbes, and O’Reilly Media. As Principal Designer and owner at Ideaplatz, LLC, Cheryl brings her sought-after workshops, talks, and consulting to a global audience, sharing her insights from over 15 years of working at the cutting edge of user experience design.