How we designed it: the Google I/O ’18 Action for the Google Assistant
Welcome to a two-part post about designing and building the Google I/O ’18 Action for the Google Assistant, co-written with April Pufahl (@aprilpufahl) . Check out the code here.
Making the most of an event like Google I/O is a challenge, whether you’re attending in person, joining an extended viewing party, or watching from home. We, the Actions on Google team, saw this as an opportunity to provide an assistive experience for event-goers and watchers alike. Like we did last year, the team wanted to build an Action for the Google Assistant that would make Google I/O ’18 easier for everyone. In this post, we highlight our design process for the Google I/O ’18 Action, including some of the key challenges we faced and decisions we made. In a following post, we’ll detail the technical implementation of the proposed design.
Conversation design is a powerful approach, but it’s not right for every use case. For example, dialog works well for the task of finding a restaurant’s business hours, but it feels clunky for browsing a dinner menu. Before starting, we wanted to ensure that conversation would add value for our users through its speed, simplicity, and ubiquity. Here’s why we thought dialog would be a good fit:
- Talking about events is intuitive. Users already have a mental model for talking to event planners and staff about what events are happening and where they’re located.
- Users can ask questions about an event. Questions let users shortcut to exactly what they want, whether that’s schedule information, session topics, or directions.
- Users need to multitask, especially when attending events. Their hands or eyes may be occupied, or they may be on the move.
- Users feel comfortable talking, or typing, about non-personal information, like asking for details about the event.
Nailing down the use cases
So, how did we start? As with any software project, we started by gathering requirements. We wanted to focus on use cases that
- cover most users, including attendees and non-attendees
- address users’ most frequently asked questions about I/O
- emphasize speed, particularly for attendees during the event
- respect the different goals users have before, during, and after an event
Trying to design for all of these goals was a massive challenge. We had to create a single solution for different use cases across multiple user profiles before, during, and after I/O. However, we didn’t want to sacrifice high quality for more features since a secondary goal was to open-source the Action for devs everywhere. In keeping with the tradition of the Android IOSched app, we hoped our solution could be a tool and model for Actions on Google developers.
To gain insight into users’ most frequently asked questions, we relied on two sources: 1) data from Dialogflow analytics for the I/O ’17 Action, and 2) talking to Googlers who’ve worked at I/O in previous years.
The I/O ’17 Action answered some basic questions about the keynote, location, etc. and allowed users to find sessions by area of interest. Our usage data shows that the session-finding feature was used quite a bit, with a relatively low fallback rate once users started down that dialog path. Most users asked about the keynote or the swag; they didn’t discover our other canned responses for announcements, I/O extended, etc. Based on this, we wanted to focus on improving the user experience for browsing and refining sessions, and enhancing discoverability of the supported FAQs about the event.
Our talks with Googlers who worked at I/O revealed four major categories for questions asked by attendees:
- General navigation (e.g., Where’s the bathroom?)
- Personal navigation (e.g., Where’s my next session?)
- Event details (e.g., What time is lunch?)
- Location-specific event details (e.g., What’s the next session in this room?)
Based on this, we wanted to add support for navigation, making it as easy as possible to get around the event. We also wanted to connect to user’s schedules so they could access that navigation quickly and on the go. Each of these would require some compromise when it came to the actual implementation (as you’ll read about in part 2).
Given all this information, we created user personas and journeys that would guide our designs.
Creating a persona
Next, we focused on designing the front end of our Action — i.e., the persona users would interact with. The goal of creating a persona is not to trick the user into thinking they’re talking to a human, but simply to leverage the communication system users learned first and know best: conversation. We focused on the qualities we wanted users to perceive when talking to our Action — that is, someone who is
- practical and straightforward
- an I/O expert
We decided that a Google Developer Expert (GDE) is a great example of a character who embodies these characteristics. These are curious, inventive, enthusiastic folks the developer community looks to as models of openness and expertise. At this point, we made the mistake of preparing a page-long biography of the persona, including their life story, motivation, and technological preferences. This was overkill; a good persona should be simple enough to keep top-of-mind when writing dialog. So, we cut it down to a short paragraph. We wanted to evoke the sense of a game master by calling our persona the Keeper of I/O-Specific Knowledge. For new users, this name is spoken out as part of the greeting. Notice that the name has an acronym… KIOSK. Pretty punny, huh?
Next, we had to choose a voice to give life to this persona. Recording voice audio would have been ideal to improve the experience, but, considering the variation in prompt strings, this would have made implementation extremely tedious. Instead, we reviewed the available voices in the Actions on Google catalog. Of the available TTS voices for United States English, we ranked “Female 2” highest on the attributes “practical”, “techie” and “straightforward”.
Creating sample dialogs
Once we had a clear picture of who’s communicating (our users and our persona) and what they’re communicating about (our use cases), we could start writing the dialog. This let us experiment to get a quick, low-fidelity sense of the sound and feel of the interaction, without the technical distractions of code notation, complex flow diagrams, recognition-grammar issues, etc. We started with a spoken dialog in which our user (Anna) is currently at I/O and wants information about the remaining sessions that day.
First, we had to write the greeting in a way that 1) welcomed the user, 2) set expectations, and 3) let the user take control. Since the name “Google I/O 18” doesn’t really give a sense of what our Action can do, we described it as a “launchpad”. Critically, we wanted to tailor the user experience for attendees and non-attendees since they have different goals in mind (e.g., getting directions versus watching the livestream). So we asked a narrow-focus question (in this case, a yes/no question). At this point, we did a little more to set expectations by leveraging our persona’s title (the Keeper of I/O-Specific Knowledge) as well as providing the user with some suggestions (manage schedule, find things to do, get directions) before asking a wide-focus question (which do you need?). At this point, the user could choose one of the suggestions or shortcut to anything else our Action supports (e.g., keynote or browse sessions).
In the rest of this sample dialog, we created menus to guide the user to the various use cases we wanted to support. This is helpful because users won’t automatically know what they can ask for when they start talking to our Action. At the same time, we didn’t want to overwhelm the user with too many options, so we tried to place them into intuitive categories. Instead of offering a flat menu of nine options, we grouped them into these 3 categories:
Learn from the experts
- Office hours
- App reviews
- After hours
We ended up creating 10 more sample dialogs, each of which highlighted a different set of use cases.
Flesh out the design
Once we had a few sample dialogs, we started summarizing the high-level flow and logic of the conversation. In doing this, it became clear that the dialog structure was multidimensional — that is, variation in the dialog was dependent on not just one or two user traits, but all three of the following:
- Whether the user was new to the Action or a returning user
- Whether the current date of the session was before, during, or after I/O
- Whether the user was attending I/O or not
This left us with something like the following:
Fundamentally, we accounted for the following use cases:
- Asking for general information about the event (date, location, keynote, etc.)
- Finding things to do or directions at the event (for attendees)
- Browsing the schedule
- Browsing one’s own schedule
The general event questions were fairly simple to handle. For each question about the keynote, swag, food, etc., we came up with a set of canned responses based on the date relative to the event. We made sure to listen to each response as rendered in text-to-speech (TTS) and used Speech Synthesis Markup Language (SSML) as needed. Typically, we added silence to make the phrase sound more natural, allowing our persona to speak in “breath units” like a person would.
For example, if the user asked anything about the clothing to wear at I/O, we designed a response along the following lines:
Notice that there were multiple conditions considered for each prompt. Furthermore, when it came time to scale our design from a voice-only to a multimodal conversation, we were able to reuse our spoken prompts as the basis for the display prompts, condensing them whenever possible to optimize for scannability.
For the schedule-browsing use case, we were able to expand on the flow built for the I/O ’17 Action, allowing users to browse office hours as well. We also challenged ourselves to address the changing user needs before, during, and after I/O by offering a smarter browsing experience. For instance, before the event we might want to present all the options to the user. However, during the event, we might only want to provide the sessions upcoming within the same day, only allowing them to browse the next day’s sessions after they’ve exhausted that first list.
We also tailored the experience depending on whether the user was interacting with the Assistant on a device with or without a screen. If the user were talking to Google Home, we’d only present six randomized topics (of 17 total) in the spoken prompt to avoid overwhelming the user. If the user were on their phone, we’d leverage the screen to show a list of all 17 topics in alphabetical order. Notice the different prompting and the use of more natural speech like “There’re” instead of “There are”. The design grew in ambition here, and we had to make compromises in implementation, but we had something that made sense for any user in any circumstance.
The use cases for getting directions and browsing one’s own schedule were largely dependent on implementation, so it was hard to sketch out at this stage. There were dependencies on the Android IOSched app (if we decided to open it for users on Android) and on the scheduling backend. Given the preview state of Google Sign-In for Assistant, we couldn’t use it for this Action. We left this section of the dialog open-ended until these factors were more clear.
Once we felt we had a good skeleton, it was time to design for the long tail. This meant handling as many cases as possible where things can go wrong. As you might imagine, it became pretty complicated. The design holds multiple logical decision points, error and rejection handling, menus within menus, and more. We considered every user response for every prompt, including an unknown response (or fallback), a rejection, an acceptance, a system failure, etc. Furthermore, we had to tailor each dialog path and prompt for each Assistant surface, from smart speakers to phones. We created specific prompts for “reentry” into a given state of dialog, where the user finds themselves at a decision point similar to one they’ve encountered before. Remember, the persona had to be woven throughout all of this.
We were able to leverage some of the same patterns throughout the Action. For instance, the error-handling strategy remained largely the same throughout the dialog. It was usually a variation of the following, adjusted based on the context of the last question that was asked:
Notice that we spelled out “4 0 4” as three numbers to ensure it would be rendered appropriately in TTS. Even when writing error prompts, we continued to write to our persona — including a techie joke while remaining practical and straightforward.
At this point, we had an ambitious and comprehensive design. We accounted for all intended use cases and as many edge cases as possible with some space for technical dependencies. Now, we had to actually build it…
Check out the second portion of this post, where we discuss the implementation of everything above. Curious to see how it ended up? You can try the Google I/O ’18 Action for the Google Assistant today! Just say “Hey Google, talk to Google I/O”. We’ve also open-sourced the project on Github here.