Bringing Tabletop Audio to Actions on Google through media responses

Published in

Google Developers

8 min readAug 5, 2019

Finding the best music to fit the mood is something that a lot of game masters look for, and a website called Tabletop Audio has been meeting this need through high-quality original tracks for a variety of genres and ambiances.

It was in my second year of college that my friends and I began playing tabletop RPGs. We came up with an elaborate story, and tried to make our games as immersive as possible. The preparation for each session including looking for instrumental music from video games that would fit the mood of each story beat.

In my most recent meetup with my friends to play another board game, I brought my Google Home Mini to provide ambiance and to avoid awkward silences while we thought about our next move.

Building a voice experience for Tabletop Audio made a lot of sense, as it’s easy to start music through a voice command, and Tabletop Audio provides a good collection of audio to listen to based on the situation.

I created an Action for the Google Assistant, and you can now say “Talk to Tabletop Audio” to invoke it and get suggestions. You can then ask questions like “What songs are new”, or start playing a song by saying things like “Play Medieval Fair”. The project has also been set up to support invocation phrases, enabling users to start music with a single query, such as “Ask Tabletop Audio to play Medieval Fair”.

Smart display result for saying “Ask Tabletop Audio to play Medieval Fair”

This is built primarily with media responses, which will let you return an audio file of any length along with track metadata. The Assistant will begin playing this file along with built-in controls on-screen and with your voice. You can pause by tapping on the Google Home Mini, pressing the pause icon, or saying “Hey Google, pause” at any point for the audio to pause. The developer doesn’t have to do any media player logic.

Design

Before writing code, I thought about the different types of interactions that I would want a user to have with this Action, in order to design the conversation appropriately. I started by creating a series of sample dialogs. After I did this, I was able to begin creating intents with Dialogflow.

USER

Hey Google, talk to Tabletop Audio.

GOOGLE ASSISTANT

Sure, here’s Tabletop Audio.

TABLETOP AUDIO

Welcome! I can play a specific track, like ‘Wizard’s Lab, or a genre, like ‘fantasy.’ What would you like?

USER

(Hmm, the party is currently camping.)

Play something with nature.

TABLETOP AUDIO

I found several tracks for the category nature. Here is one at random: Secret Garden.

(Track begins to play)

(A few minutes later)

USER

Hey Google, what song is playing again?

TABLETOP AUDIO

(Music pauses)

This is Secret Garden from Tabletop Audio.

(Music continues)

— -

USER

Hey Google, talk to Tabletop Audio.

GOOGLE ASSISTANT

Sure, here’s Tabletop Audio.

TABLETOP AUDIO

Welcome! I can play a specific track, like ‘Wizard’s Lab’, or a genre, like ‘fantasy.’ What would you like?

USER

What tracks are new?

TABLETOP AUDIO

The last three tracks added to Tabletop Audio are: Wizard’s Lab, Robotics Lab, and Castle Jail. What do you want to listen to?

USER

Castle Jail

TABLETOP AUDIO

Here is Castle Jail

(Music starts playing)

— -

USER

Hey Google, talk to Tabletop Audio.

GOOGLE ASSISTANT

Sure, here’s Tabletop Audio.

TABLETOP AUDIO

Welcome! I can play a specific track, like ‘Wizard’s Lab’, or a genre, like ‘fantasy.’ What would you like?

USER

(Distracted)

No I don’t know where the car keys are

TABLETOP AUDIO

I can’t find a track with that description, but I found others like Cave of Time. You can also ask “What are the latest tracks?”. What would you like to listen to?

USER

Oh, um, play Cave of Time I guess

TABLETOP AUDIO

Here is Cave of Time

(Music starts playing)

Media Responses

First, I needed access to the API for Tabletop Audio. I reached out to the owner with this idea, and he was happy to provide me with a JSON URL that contains an array of every track’s metadata. Once I had access to the API, I was able to begin making queries to get each track’s audio URL and related metadata such as the track name, genre, and image. To prevent a given user from making multiple queries to this API each time they use it, I decided to cache the data when the user first invokes the Action using session data. I can do this using the Node.js Actions on Google library by with the conv.data object. This data will only be available until the Action ends.

As the function was written in TypeScript, I created several interfaces to formally define the data structures I expected. This allows me to take advantage of type-safety through the compiler, and autocompletion in my IDE.

I define the API response in TabletopAudioResponse, which has an array of TabletopAudioTrack objects. I store the API response and the currently playing track in my session data, which is represented by the TabletopAudioSession interface. Then I extend the default DialogflowConversation interface with these types. You can see the implementation in the code snippet below:

To make sure I only call this API once per user per session, I created a function to fetch this data and save it in session data. You can see the implementation of the function below:

When I first prototyped this Action, I only called this function in my welcome intent. After further testing, I found that my Action would not work when I used an invocation phrase because the Action would skip the welcome intent and try to run the playback intent using data that was never fetched.

To make sure that this function would always run the first time that the Action started, I created a middleware function. You can see an implementation in the snippet below:

Now that I had a list of tracks, I was able to create a media response. I abstracted the logic to a separate function, as shown in the snippet below:

My playback intent could then start playing this track:

Now that I had a prototype for playing back audio, it was time to implement a search capability.

Search

Each track has a title, a list of genres, and a list of tags. A user may want to find a track based on any of these parameters, and I wanted to make sure I could capture any of these values and scan the tracklist.

Using Dialogflow, I was able to create a set of training phrases to represent all of the possible ways that a user could search for a given track. I highlighted each searchable term within the phrase and marked it as a parameter with the type @sys.any. This entity type can capture any text, but does mean you should add more training phrases to ensure that Dialogflow can identify what is part of the search.

To make sure that this intent did not trigger too frequently, I changed its priority to Low by selecting the dot in the top-left corner of the page.

By passing the parameter search to my fulfillment, I added a simple search function. If a search query matched a track title, it would begin playing that track. Otherwise, if the user searched for a genre or tag, it would return a random result. I also added some sanitizing of track titles so that users didn’t have to worry about letter case or other symbols. You can see the implementation in the snippet below:

Each song is ten minutes. At the end, the Action receives a callback and it asks a question to the user for the next track to play. Since I know what track is currently playing, I am able to implement a repeat intent which will start playing it from the start. You can see the implementation in the snippet below:

As I began testing this, I saw that some users were not sure whether the Action was still active when they gave a follow-up query after music began to play.

Examining the history of interactions in the Dialogflow console, I saw several instances where someone would say “Ask Tabletop Audio for Outpost 31”. This would then look for a track literally called “Ask Tabletop Audio for Outpost 31” and then tell the user that nothing was found.

With this usage data, I was able to return to the list of training phrases and add several additional phrases to better specify how to extract the search query.

Polished Voice Design

Now that the playback and search were working well, it was time to add extra refinements to make the voice user interface better.

First, I decided to add two additional intents to help guide the user with answers to common questions. One intent would answer about what songs are new, and the other would answer about what this Action could do.

I also took the time to differentiate the visual response from the audible response. On visual surfaces like phones, it’s easy to provide additional information through on-screen suggestion chips or cards. With an audio-only surface like a speaker, I provided a more verbose answer.

To do this, I replaced many of the strings in my conv.ask with a SimpleResponse object that has both a speech and text properties. My Action will speak aloud the speech, while showing the text on the screen. You can see an example of my updated welcome intent below:

Conclusion

The Tabletop Audio Action is a real-world example of how you can build great audio experiences using media responses. With a few small design considerations, you can create a high-quality voice experience that also runs well on phones, smart displays, and many other Assistant surfaces.

To learn more about the implementation of this Action, you can view the source code on GitHub. To learn more about building conversational Actions for the Google Assistant, you can start reading our developer documentation. To learn more about designing high-quality conversational Actions, you can start reading our conversational design best practices documentation.

Want more? Head over to the Actions on Google community on Reddit to discuss Actions with other developers and share what you’ve built on Twitter with the hashtag #AoGDevs. Join the Actions on Google developer community program and you could earn a $200 monthly Google Cloud credit and an Assistant t-shirt when you publish your first app.

Bringing Tabletop Audio to Actions on Google through media responses

Design

Media Responses

Search

Polished Voice Design

Conclusion

Written by Nick Felker