SSML for Actions on Google

Published in

Google Developers

4 min readApr 18, 2017

An important part of designing great conversation actions for the Google Assistant is thinking about how you want them to feel and sound. If you’re creating a fun game, you might want to use a whimsical tone. If you’re building a news reader, you might want to use a more deliberate, serious tone.

Actions on Google lets you add audio to actions, which gives dimension to dialogs and a sense of atmosphere to the overall user experience.

<SPEAK>

To play audio as part of dialogs, actions support the Speech Synthesis Markup Language (SSML), a standard markup language for the generation of synthetic speech.

To use the SSML markup, start by wrapping your prompts inside <speak> tags:

<speak>Welcome to Number Genie!</speak>

Then use other SSML tags to add sounds and control the audio rendering. For example, to play an audio file:

<speak><audio src="https://.../meow.ogg"></audio></speak>

Since SSML is based on XML, special characters need to use XML escaping:

<speak>&quot;What have I learned?&quot; he asked.</speak>

Take a Break

When designing a conversational flow, don’t just consider the words but also the pace of the dialog. This is especially important when you have a design that calls for something more wordy.

For our Interactive Fiction actions, we added SSML <break> tags between the sentences to better pace the story telling:

<speak>Grunk think that pig probably go this way.<break time="800ms"/>It hard to tell at night time, because moon not bright as sun.<break time="800ms"/>There forest to east and north.</speak>

An important aspect of our conversational design principles is to test your dialogs before you implement them, to get a feel for how they’ll sound for end users. Try reading them out loud to your colleagues, or use our web simulator. Keep tweaking the SSML markup values until you have the pacing just right.

The Cow Says Moooooo

Sound effects (SFX) are a very easy way to raise the production value of your action, especially when you implement games.

In our Number Genie action, we use sounds to give users fun feedback to let them know how well they are doing during the game:

A cold wind sound when the guess is very far from the answer,
A steam sound when the user is within 3 of the answer,
A steam sound with bells when the user is even closer,
A congratulatory sound when the user guesses the answer.

But where can you get sounds for your action? Well, the YouTube audio library provides over 5,000 free sounds that you can use in your own projects. We’ve picked our favorite short sounds for the actions sound library and hosted them for you on Google’s servers so you can reference them in your actions:

<speak>
<audio  
  src="https://actions.google.com/sounds/v1/alarms/alarm_clock.ogg">
</audio>
</speak>

P-R-O-N-U-N-C-I-A-T-I-O-N

In addition to supporting audio playback, SSML also lets you have more fine-grained control over how your prompts are pronounced, making your action’s responses seem more life-like and appropriate for the kind of information provided to the user.

In particular, when you have to say numbers or dates, you can specify how you want to want the data to be interpreted. For example, if you want to say “12345” as “Twelve thousand three hundred forty five”:

<speak><say-as interpret-as="cardinal">12345</say-as></speak>

Other interpretations for numbers, characters, dates, times and telephone numbers are also supported.

Let’s Play

Now we can bring this all together by designing our own trivia game action. We’ll be using many of the SSML features to create the mood and SFX of a typical game show.

For our persona we want a game show host so we pick the voice, “male 2”, from our list of voices available for actions.

Now, design the greeting for users of the action:

<speak>
<audio src="https://.../game_intro.ogg"/>
Let’s play the SSML Trivia Game!
Put on your game face.
Here comes your first question.
<break time="500ms"/>
Which one of these is the world’s tallest waterfall?
<break time="500ms"/>
Angel Falls
<break time="500ms"/>
Victoria Falls
<break time="500ms"/>
Or Niagara Falls
<audio src="https://.../ding.ogg"/>
</speak>

Note the use of the ‘ding’ sound to make it clear to the user that the question is complete and it’s the users turn. This is an example of an earcon (think ‘icon’ for ears) which are distinct sounds that provide feedback or convey additional information.

Once the user provides an answer, the action can further recreate the ambiance expected for a typical game-show with an audience-reaction sound:

<speak>
<audio src="https://.../audience_reaction_correct.ogg"/>
You called it. Great job!
Here’s the next question.
...
</speak>

If you are using API.AI to develop your action, then you can use SSML in the text responses when creating an intent:

Or if you use fulfillment to dynamically generate the responses in code, then you can use SSML with the Node.js client library “ask” or “tell” methods:

assistant.tell('<speak>OK. See you next time!
   <audio src="https://.../bye_sound.ogg"/></speak>');

Make sure you wrap the <speak> tags around the entire string, and not just a subset of it.

Next Steps

I’ll leave it to you to design the rest of the game show dialogs. Start with the happy path and keep adding support for other typical user interactions. Use sound to delight and entertain your users!

After you have confirmed your action meets the guidelines in the Actions on Google design checklist, submit your action so that everybody can enjoy your fun action.

Update: We’ve added support for custom marks to synchronize Interactive Canvas animations with SSML events.

Thanks to Nandini Stocker, Google’s Conversation Design Lead, for co-authoring this post.