How we built it: the Google I/O ’18 Action for the Google Assistant

Welcome to Part 2 of a post about building the Google I/O ’18 Action for the Google Assistant. Check out the code here.

tl;dr

Dialogflow

  • See our Dialogflow agent here (zip up the directory to import into Dialogflow)
  • Use follow up intents and contexts for sub-menus like browsing flows. Learn more here.
  • When doing this, be careful of how contexts commingle to create unexpected behavior. To be on the safe side, deactivate extraneous contexts when entering a sub-menu.
  • Unzip your Dialogflow agent before checking into version control. This allows for better code review and accountability.

Webhook

  • Create a re-usable data structure for your prompts, and organize them by intents (e.g. our directory, how intents were routed).
  • When using Cloud Functions, be aware of global scope caching. It’s useful when you want it, painful otherwise.
  • Actions are like any other software. Dream big, but start small and iterate.

In Part 1 of this post, we walked through the process of designing the Google I/O ’18 Action. This included brainstorming our use cases and persona to produce sample dialogs and early high-level flows like this:

So the question became how to get from there to here in Dialogflow:

Implementing the Dialogflow agent

Our herculean task was to implement the entire conversational design (i.e. all possible turns of dialog and the transitions between them) as a Dialogflow agent. At this point, we had a detailed design specification which included coverage for the long tail of ways a conversation can deviate from the most common paths as well as handling for errors and other unlikely or uncommon scenarios. Note that this required more than a simple flowchart. After all, directional relationships in flowcharts don’t necessarily reflect all the unique paths through a conversation. A flowchart more often represents the turns in a particular instance of a dialog. Though the turns progress in a certain order for that dialog, different users’ conversations may follow other paths that aren’t represented in the flowchart. Moreover, a flowchart does not always reveal how contexts create multiple conversation paths.

When porting this spec to Dialogflow, it’s important to consider what a Dialogflow “intent” actually is. It is a user-triggered state of the dialog that matches a training phrase and returns a response from the agent. Each intent represents a single turn of dialog. The collection of intents in a Dialogflow agent forms a finite state machine, where some collection of states combine to form a complete dialog. From any single state in the machine, the possible subsequent states are any for which the input contexts are active (or which have no input contexts). Out of these possible state transitions, Dialogflow chooses the one whose training phrases most closely match a user’s utterance above a certain threshold.

Identify the intents

So, what did all this mean for us? First, it meant distilling the entire conversational flow into either top-level or nested (i.e. follow-up) intents. Top-level intents could be matched at any time in a given dialog. For example, the user can technically ask about directions to a session at any point in the dialog. Although our conversational flowchart places this conversational state several levels deep in the dialog, there is actually no additional context required to provide directions to a user at the start, for instance. As it turns out, most intents can be top-level and provide more flexibility to users.

Follow-up intents were required for more involved sub-dialogs within the larger scheme of the conversation. Take, for example, the use case of finding a session topic to hear more about. This interaction starts with the browse-topics intent, which listens for the user to say things like the following:

  • “Find a session”
  • “Topics”
  • “What are the talks at I/O this year?”

For any of these phrases, we consider the user to be expressing the same underlying goal (intent). That is, they want to know the categories of sessions at the event. Giving them this list, however, gets complicated. On a speaker device, the list of topics is presented in segments to avoid overwhelming the user. Because of this, we needed an intent for the user to declare that they wanted the next set of options. The user also needs a way to ask to repeat a given set of options, as required in the design. This setup requires at least two follow-up intents to the browse-topics intent (one for requesting the next set of options, and one for repeating the current set of options).

Create the browsing flows

Given that there are multiple browsing flows in the Action (e.g., browsing for topics, browsing for sessions within a topic, etc), one solution would have been to create a top-level “next option” intent to be shared by any of these flows. We decided against this for a few reasons. Firstly, separating these flows into multiple follow-up intents creates more conceptual separation. Secondly, handling a single “next option” intent would have required extra logic and the use of special contexts in Dialogflow to fetch the correct “next” set of options within a browsing flow. Instead, we duplicated the necessary follow-up intents for each browsing flow, resulting in sets of intents that looked like this:

Handle context conflicts

How do follow-up intents work? They define a directional relationship between intents through contexts. The follow-up intents define a context (the input context) which must be active in order for Dialogflow to match that intent. The top-level intent activates this same context (the output context) when it has been matched. In our case, the browse-topics-next and browse-topics-repeat intents require the browse-topics-followup context to be active in order to be matched. The browse-topics intent activates this context when the user says they want to browse the available topics.

The other key component to recognize is the clearing of the other browsing flow contexts (setting their lifespans to 0). This is a pattern used in each of the top-level browsing intents (browse-sessions and show-schedule). Why is this necessary? Well, let’s imagine those are not there and the user engages in the following dialog (paraphrased):

What would happen next? Well, without each top-level intent clearing the contexts of the other browsing flows, it is not guaranteed that the user’s last query would match the browse-sessions-next intent. Instead, it might match the browse-topics-next intent and give the user the next set of topics even after they’ve already chosen a topic!

This is because both the browse-topics-followup and browse-sessions-followup contexts would be active. We prevent this by clearing the followup contexts of other browsing flows at each top-level browsing intent.

In developing the agent, we started by building out the top-level intents. These are the easiest to create and test, since there are no contexts to keep track of.

Collaborating on the agent (version control)

From the start, we could tell that developing this agent would require a heavy amount of collaboration. For that reason, we decided our only option was to include the agent in version control along with our backend logic. Moreover, the agent would live in our Git repo as an unzipped agent directory. Because of this, our code reviews could catch issues in any newly added intents or contexts. This was an important step in creating shared knowledge about the inner workings of the agent and nipping problems in the bud.

Implementing the webhook

Of course, our webhook powered this entire agent. We chose to use Cloud Functions for Firebase, as we do with all our samples, as it provides a very cost-efficient and straightforward webhook implementation for a service like ours. There’s no need to worry about scaling the service in times of extreme demand or paying for idle time when traffic calms. Moreover it provides first-class integration with other Firebase services, like Firestore and Auth.

Organize the prompts

One of the first problems to solve in the logic was having some way to identify the current time during a given dialog. Our design provided a slightly different conversational experience for users depending on the current date (before, during, or after the event). To solve this, one of our first commits used the Actions on Google Node.js Client Library middleware function to pre-process the current user “phase” before every turn of dialog. This phase string value (either “pre”, “during” or “post”) was attached to the global conv object when fulfilling a given intent. The prompts could then be chosen conditionally based on that value.

But how would we choose the appropriate prompt for a given phase? Certainly it would not have been wise to use a giant “if/else” within each intent to map the phase to a given prompt. Furthermore, the phase was not the only condition on which to choose a prompt in the design. Other conditions included:

  • Whether this was a returning user or a first-time user dialog
  • Whether they were on a screen or speaker device

For each of these conditions, the following had to be chosen and sometimes randomized:

  • The Simple Response/Basic Card/List elements of the response
  • Suggestion chips shown after the response
  • Fallback responses to use in case the following user utterance is not matched to a Dialogflow intent
  • No-input responses to use in the case that the user responds with silence on a speaker device

To account for all of this variation, it was clear that we needed a dedicated, flexible data structure for prompting. In the prompts directory of our source code, you can find this data structure represented for each of the major use cases of the Action (top-level questions and each browsing flow). The data structure was organized as follows:

All prompts were stored in the prompts directory, and the common parsing logic lived in a single JavaScript file.

In separating the logic and prompt string data, we also ended up splitting the logic across 3 major flows:

  • “Static” questions, mostly corresponding to top-level Dialogflow intents
  • Menus for browsing (either topics or sessions within a topic)
  • Accessing and browsing one’s schedule

Each of these became a separate sub-directory of the prompts directory, with a utils.js file encapsulating the application logic unique to each. The driver logic in the root app.js file routes intent handling to any of these utils.js files.

Get conference data: a lesson about Cloud Function caching

In order to fulfill any one of the browsing flows, we needed a dedicated module for fetching conference data. The conference data for the event is stored in a JSON file hosted on the cloud. So, what’s the easiest way to handle this? One option was to fetch the JSON data to the Cloud Function once on wake, store it in a global scope variable, pre-process it, and access it from there across future invocations. The major problem here is that the JSON file changes content quite a bit (sessions being added during development, livestream links being added during the event, etc), and the Cloud Function may stay awake for quite some time after waking. We needed a real-time solution.

Our first approach to fixing this was to create a global async function (Promise) that fetched, pre-processed, and returned the JSON data, and then call this function from the intent handling logic. The idea was that by declaring this as a reusable function, we could pull fresh data in every dialog. The interesting problem here was that globally declared functions were run on wake, and as a Promise, the resolved value is cached for future calls. This issue? It created a lot of confusion as to why new data was not being shown during testing!

The solution, then, was to create a class called ConferenceData that is newly instantiated for each turn of dialog. This class fetches the data when it is first needed, pre-processes it (including clean up, de-duplication, etc), and caches it for the remainder of that function execution. This guaranteed fresh data at all times, and even lowered the response time for the user in cases where a single function execution accessed the data multiple times.

An astute observer might recognize something wrong here. This solution meant that conference data is potentially fetched and pre-processed at each turn of dialog. This is not an ideal solution given the relative cost of these operations. One easy fix would have been to store the JSON data in conv.data at the start of each dialog, but the size of the data simply did not make sense for this string data.

The ultimate solution that we considered, but had to drop due to time constraints, was to build a backend system that would fetch, pre-process, and cache the conference data periodically. This would provide near real-time accuracy of the data and significantly lower latency by separating the application logic from the parsing logic. In any future project, I’d consider this a far higher priority task, as it directly affects the user experience. In our case, we had something that worked.

Cutting scope

So, what else did we leave out? As with any software project, we encountered all kinds of limitations, which resulted in cutting scope. For example, one feature that we cut for time was allowing the user to say whether or not a given set of options sounded interesting to them. The initial design called for dialogs like the following:

User: “What are all the Android talks?”
Action: “First there is Android Talk 1, then there is Android Talk 2. Do either of those sound good to you?”
User: “Yeah”
Action: “Okay, which one?”
User: “Let’s hear about…”

and

User: “What are all the Cloud talks?”
Action: “There is only one talk on Cloud. Do you want to hear about it?”
User: “Yeah”
Action: “Okay, that one…”

In the case we present two or more options, the user can either choose a topic or indicate that they are or are not satisfied with those options. In the case we present only one option, the user can simply confirm that they want to hear about it. However, we ended up not supporting this flow. Why? The use of “Yeah” has a collision. In responding to the same prompt, the user might say “Yeah” for hearing about a given single session, or to choose among two options. Because of this, we would have needed to create a single Dialogflow intent that was trained with phrases like “Yeah”, “Yes”, etc, and build in application logic to determine how to respond based on the size of the last set of items presented.

These kinds of features add up in engineering time (and potential bugs introduced) and had to be dropped in order to build a more complete Action. This required prioritization and estimation, especially in balancing engineering time with the potential UX return.

QA and Release

After the initial development period, we began to conduct basic QA testing on the Action. This entailed running through comprehensive sets of dialogs and ensuring the experience made sense. Were we asking yes/no questions when it made sense? Did the SSML delivery of the prompts make sense? Was any of the speech too fast or too slow? Many of our changes at this point were tweaks of the prompts.

We also took this time to add implicit invocation intents, which increased discovery of the Action for any Google Assistant user looking to learn more about the event.

Bugs encountered

This was also a critical time for us to catch many of the bugs we hadn’t encountered during development. One of these was a failure to follow the data structure format in one of our prompts, causing the prompt parsing logic to crash. How did this play out for the user? After being greeted by the Action in one of the first dialog turns, if the user said something unrecognized by the Dialogflow agent, they received a static fallback message repeatedly. The intended behavior is to exit the conversation after 3 turns of unrecognized input, as per our design guidelines. Why was this happening? We forgot to wrap RichResponse messages in an array in some of our prompts. This was a great example of the relative brittleness of the prompt data structure and the corresponding parsing logic.

There was yet another critical bug encountered through a brittle process (and a bit of foolishness). While keeping the unzipped Dialogflow agent in the Git repo was a great idea for code review, it led to some issues with deployment. The deployment process was as follows:

  1. Fetch latest code and Dialogflow agent from centralized Git repo to local codebase
  2. Deploy Cloud Functions code
  3. Zip Dialogflow agent directory
  4. Import agent .zip file into production Dialogflow console

Where did this break? In the case where our local development copy of the Dialogflow agent had more intents than the centralized version (introduced earlier in development but deleted since then), the Git fetch did not clear out these unnecessary intents as they were not tracked. This meant my deployments to Dialogflow sometimes included extraneous intents! We started noticing this when seeing too many intents in our production agent, and we fixed it by ensuring we started with a fresh download of the dialogflow-agent directory before every deployment.

Eventually, even with all of the hiccups and limitations, we produced an Action that we were proud of. Today, the Action has a 4.7 star rating on the Assistant directory with over 300 ratings. There is so much opportunity for all of you developers to build even better Actions on the platform, and we hope these posts serve as guiding lessons in achieving that goal.

Want to see everything we talked about here and even make some contributions of your own? Check out the code here.