Advanced SSML for Actions on Google

Published in

Google Developers

5 min readMar 16, 2018

Users are attracted to entertaining and engaging conversational experiences. One of the best ways to achieve that is to use music and sound effects.

But how can developers create high production value without their apps sounding like a bad case of elevator music?

Well, this can be achieved by using the advanced features provided by SSML for Actions on Google. You don’t even need to be an audio designer. We are going to teach you how by using design patterns we have used in our own apps.

Actions are Wired for Sound

Previously, in SSML for Actions on Google and More SSML for Actions on Google, we announced the growing number of SSML elements supported for Actions on Google. In particular, our support for an audio mixer which allows multiple overlapping audio tracks with TTS to create engaging atmospheres for your apps.

Last year, we published an extensive Actions sound library containing thousands of sound effects that Google hosts for free. Developers can use these sounds in their own apps by using the SSML <audio> tag.

By combining sound effects from the sound library with the audio mixer you can create rich audio experiences like this:

All of that was done in SSML! No audio editing and compositing required.

Let’s break it down in the next sections.

It Sounds Alright on Paper

Start by thinking about the atmosphere you want and then build up the audio design by considering each phase of the experience: the beginning, the middle, the end and a background sound.

For this design pattern, here is the content for each phase:

What you also might have noticed is how these sounds overlap each other. To play audio in parallel, we use the <par> tag which is unique to Actions on Google:

<speak>
  <par>
    <media xml:id="crowd"/>
    <media xml:id="words"/>
    <media xml:id="cheer"/>
    <media xml:id="background"/>
  </par>
</speak>

Our first media element, which plays the crowd cheering, will start immediately while the other media elements will start relative to the earlier ones:

<media xml:id="crowd" soundLevel="5dB" fadeOutDur="1.0s">
  <audio src="https://actions.google.com/sounds/v1/crowds/battle_cry_high_pitch.ogg" clipEnd="3.0s">
    <desc>crowd cheering</desc>
    YEAH!
  </audio>
</media>

Some things we have customized for this sound:

We gave it an id: crowd (id’s are used for relative timing between media elements)
We made it a bit louder by using the soundLevel attribute
The sound fades out for 1 second at the end using fadeOutDur
Only the first 3 seconds of the sound is played using clipEnd
We added a description that will only be displayed on devices that don’t play audio using <desc>

The next media element will play the TTS voice prompt. We can control when each media elements’ content is played by using begin and end attributes for timing the playback:

<media xml:id="words" begin="crowd.end-1.0s">
  <speak><emphasis level="strong">Great catch by Amendola! I can't believe he got both feet in bounds!</emphasis></speak>
</media>

Some things to note about this media element:

We gave it an id: words
The prompt starts playing 1 second before the end (-1.0s) of the crowd sound by using begin
The TTS prompt is emphasized to make it louder by using <emphasis>

But what about the background sound for the prompt? We are going to control that with its own media element:

<media xml:id="background" soundLevel="-5dB" begin="1s" end="words.end-0.0s" fadeInDur="2.0s" fadeOutDur="1.0s">
  <audio src="https://actions.google.com/sounds/v1/crowds/battle_crowd_celebration.ogg">
    <desc>crowd cheering</desc>
    YEAH!
  </audio>
</media>

Some things to note about this media element:

We gave it an id: background
We dialed down the sound by setting the soundLevel
The sound begins playing 1 second from the start by using begin
The sound ends when the TTS prompt ends (with id words) by using end
The sound fades in and fades out by using fadeInDur and fadeOutDur

The last media element is the team cheer:

<media xml:id="cheer" begin="words.end-1.0s" fadeOutDur="2.0s">
  <audio src="https://actions.google.com/sounds/v1/crowds/team_cheer.ogg" clipBegin="2.0s" clipEnd="6.0s">
    <desc>team cheer</desc>
    CHEER!
  </audio>
</media>

Some things to note about this media element:

We gave it an id: cheer
The sound begins playing 1 second from the end of the TTS prompt (with id words) by using begin
The audio is clipped so it starts 2 seconds within the track and ends at 6 seconds by using clipBegin and clipEnd
The sound fades out for 2 seconds by using fadeOutDur

But it feels like there is still something missing…this is a sports announcement, so where is the music intro? Well, we found a sound clip on Wikimedia by Kevin MacLeod called “Nowhere Land” that works well:

<media xml:id="intro" soundLevel="5dB" fadeOutDur="2.0s">
  <audio src="https://upload.wikimedia.org/wikipedia/commons/4/43/Nowhere_Land_%28ISRC_USUAN1600051%29.mp3" clipEnd="5.0s">
    <desc>news intro</desc>
    INTRO
  </audio>
</media>

Now adjust the crowd sound to start near the end of the intro sound:

begin="intro.end-1.0s"

Let’s listen:

Music to my ears! It sounds much more professional.

Another source of free music is the YouTube Audio Library, but you have to host the music yourself and comply with the license attribution (search for “news theme” for some alternatives to the music above).

Here is the completed SSML:

<speak>
  <par>
    <media xml:id="intro" soundLevel="5dB" fadeOutDur="2.0s">
      <audio src="https://upload.wikimedia.org/wikipedia/commons/4/43/Nowhere_Land_%28ISRC_USUAN1600051%29.mp3" clipEnd="5.0s">
        <desc>news intro</desc>
        INTRO
      </audio>
    </media>
    <media xml:id="crowd" soundLevel="5dB" fadeOutDur="1.0s" begin="intro.end-1.0s">
      <audio src="https://actions.google.com/sounds/v1/crowds/battle_cry_high_pitch.ogg" clipEnd="3.0s">
        <desc>crowd cheering</desc>
        YEAH!
      </audio>
    </media>
    <media xml:id="words" begin="crowd.end-1.0s">
      <speak><emphasis level="strong">Great catch by Amendola! I can't believe he got both feet in bounds!</emphasis></speak>
    </media>
    <media xml:id="background" soundLevel="-5dB" begin="1s" end="words.end-0.0s" fadeInDur="2.0s" fadeOutDur="1.0s">
      <audio src="https://actions.google.com/sounds/v1/crowds/battle_crowd_celebration.ogg">
        <desc>crowd cheering</desc>
        YEAH!
      </audio>
    </media>
    <media xml:id="cheer" begin="words.end-1.0s" soundLevel="0dB" fadeOutDur="1.0s">
      <audio src="https://actions.google.com/sounds/v1/crowds/team_cheer.ogg" clipBegin="2.0s" clipEnd="6.0s">
        <desc>team cheer</desc>
        CHEER!
      </audio>
    </media>
  </par>
</speak>

See how easy it is to use sounds to elevate the excitement of your app?

Well, lets keep going and try out these design principles with other genres.

Boo!!

Don’t make a peep and listen to this horror story:

Ooh! So creepy! We’ve applied the same design principles, but added more layering of sounds to increase the emotional impact. For the TTS prompt, we lowered the voice pitch and slowed the speaking rate by using the <prosody> tag:

<media xml:id="intro" begin="sound5.end+1.0s">
  <speak>
    <prosody rate="slow" pitch="-1st">Come in!<break time='0.5'/>Welcome to the terrifying world of the imagination.</prosody>
  </speak>
</media>

Here is the complete SSML.

Bang, Bang!

Let’s make some noise and go out with a bang with a story about two brothers:

Sounds like such crazy fun! The TTS voice prompts are interspersed with various sound effects that emphasize the exciting story line. Each sound bite matches the fast pace of the action. We also play a background sound effect of rain falling. You can use repeatCount for looping shorter audio tracks:

<media xml:id="background" soundLevel="-5.0dB" repeatCount="10"
  fadeInDur="0.5s" fadeOutDur="2.0s" begin="0.0s" end="sound6.end+1.0s">
  <audio src="https://actions.google.com/sounds/v1/weather/rain_on_car_heavy.ogg"/>
</media>

Here is the complete SSML.

Hit the Right Note

The design patterns we introduced can easily be expanded to various kinds of Actions. You can make your Actions sing a different tune to engage and entertain your users.

Our colleagues from the Google Creative Lab recently launched the Grilled Murder Mystery Action, where you play detective. They open sourced their Action and documented how they used SSML.

We’ve also just launched a new version of our sound library which now supports filtering and searching to help you find just the right ping, hiss or crash.

So, grab the SSML examples and start customizing them in our Actions Console simulator. We’d love to hear how you used the power of SSML in your apps.

Update: We’ve added support for custom marks to synchronize Interactive Canvas animations with SSML events.

Want More? Head over to the Actions on Google community to discuss Actions with other developers. Join the Actions on Google developer community program and you could earn a $200 monthly Google Cloud credit and an Assistant t-shirt when you publish your first app.

Advanced SSML for Actions on Google

Actions are Wired for Sound

It Sounds Alright on Paper

Boo!!

Bang, Bang!

Hit the Right Note

Written by Leon Nicholls