Creating immersive soundscapes with SSML and Actions on Google

Published in

Google Developers

6 min readAug 14, 2018

Ever since I moved from New Jersey to California, I have been missing many things. Summers back home were full of nature. I could hear the chirping of crickets, the droning of cicadas, and the singing of birds. I wanted to build an Action for the Google Assistant that would not only play these sounds, but also would give me control over each one. I wanted to be able to add more cicadas, or maybe reduce the sound of crickets.

To create this project, I was able to take advantage of a unique feature of SSML in Actions on Google. The <par> tag represents a container in which multiple SSML elements can be designated to play at the same time. This would allow me to play back a number of different audio files with their own parameters in parallel, which is exactly what I wanted.

I was able to find a number of sounds from the AoG Sound Library, which makes it easy to search for and include different audio clips in your Actions. When users invoke it, they can give queries to add or remove sounds from the soundscape. Then, the Action will update the soundscape by generating SSML and playing it, as shown in the screenshot below.

Development Process

The Action was designed in Dialogflow and makes requests to my Firebase function to provide the user with the SSML. I created an entity to represent each sound that a user could access: cicadas, crickets, fireplace, and a handful of other handpicked items. You can see the full entity below. I made sure to check the option called Allow automated expansion. This allows Dialogflow’s machine learning to expand the synonyms to include words that I did not define. For example, the fireplace entity could also be matched by campfire or other similar phrases that I did not consider.

Another entity was created to represent different types of things that I could do with my sounds: adding a new layer of that sound effect, removing a layer of that sound effect, or clearing all layers of that sound effect.

A user query like “add more cicadas” will go to my fulfillment with the parameters of “add” and “cicadas”. There, the number of cicadas sounds will increase by one. Finally, a function is called which takes that map and generates the expected SSML.

For each counter corresponding to a sound, another audio clip is added into the parallel stream playback.

<speak>
  <par>
    <media>
      <audio src='https://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' />
    </media>
    <media>
      <audio src='https://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' />
    </media>
  </par>
</speak>

For each instance of the sound, the volume is increased slightly by changing the soundLevel attribute, and each is time-shifted slightly by changing the begin attribute.

<speak>
  <par>
    <media begin='0s'>
      <audio src='https://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' soundLevel='6dB' />
    </media>
    <media begin='0.1s'>
      <audio src='https://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' soundLevel='6dB' />
    </media>
  </par>
</speak>

As a response with SSML has a limit of 120 seconds, I repeat the clips for a specified duration. Then, I provide an audio prompt to ask the user whether they would like to change or repeat the soundscape. You can see an example of the entire SSML output below:

<speak>
  I've made that change.
  <par>
    <media repeatDur='105s' begin='0s'>
      <audio src='https://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' soundLevel='6dB' />
    </media>
    <media repeatDur='105s' begin='0.1s'>
      <audio src='ttps://actions.google.com/sounds/v1/animals/cicada_chirp.ogg' soundLevel='6dB' />
    </media>
    <media repeatDur='105s' begin='0s'>
      <audio src='https://actions.google.com/sounds/v1/ambiences/fire.ogg'  soundLevel='4dB' />
    </media>
  </par>
<break time="350ms" />
Do you want to change the sounds, or repeat this clip?
</speak>

With the basic features working, I decided to add some additional features that would make the experience more conversational and useful. I added support to let users save their favorite soundscapes with user storage and allow them to return to the Action later in order to pick up where they left off.

To do this, I serialize the map of sounds to a string and store it into user storage under a key that the user provides. Later, the user can ask the Action to load a soundscape with a given name in order to deserialize that map and start playing the SSML again. If they forget the soundscapes they’ve saved, they can ask for a list. It will return a list based on the keys stored in user storage. For each name, I wrap it in an emphasis tag and add a break after it in order for the user to better understand the choices that they can make, as shown in the snippet below:

conv.intent(INTENT_LIST_PRESETS, (conv) => {
  let response = '<speak>';
  const keys = Object.keys(app.userStorage);
  if (keys.length) {
    response += 'You have saved the following presets: ';
    for (const k in keys) {
      const presetName = keys[k];
      response += `<emphasis level='strong'>${presetName}</emphasis><break time='300ms' />`;
    }
  }
  response += 'I have also created some presets you can use, such as: ';
  const defaultKeys = Object.keys(PRESETS);
  for (const k in defaultKeys) {
    const presetName = defaultKeys[k];
    response += `<emphasis level='strong'>${presetName}</emphasis><break time='300ms' />`;
  }
  response += `<break time='300ms' /> Do any of these sound good   to you?</speak>`;
  conv.ask(response);
});

The response of the SSML from this intent may look similar to this:

<speak>
  You have saved the following presets:
  <emphasis level='strong'>Hometown</emphasis>
  <break time='300ms' />
  <emphasis level='strong'>Night time</emphasis>
  <break time='300ms' />
  I have also created some presets you can use, such as:
  <emphasis level='strong'>Beach</emphasis>
  <break time='300ms' />
  <emphasis level='strong'>Camping</emphasis>
  <break time='300ms' />
  Do any of these sound good to you?
</speak>

Improve the User Experience

This Action has a lot of features, and it could be complicated for new users to understand how everything works. Using the last seen feature, I can give new users a more elaborate tutorial while returning users can quickly return to what they were previously doing. You can see a brief snippet of this in the code snippet below:

conv.intent(‘Default Welcome Intent’, (conv) => {
  if (conv.user.last.seen) {
    conv.ask('Welcome back. Should we start with a saved preset, or    create a new clip?');
  } else {
    conv.ask('Hello. I am your guide to creating soundscapes…')
  }
})

I also added suggestion chips, which show users the various things they can do in the Action, in order to improve the discoverability of these advanced features.

To summarize, the <par> tag has enabled us to build and iterate upon a rich user experience that allows the user to individually control different sounds and let all of them play at once. The same techniques could be used to build immersive audio game experiences or help add environmental noises as your Action returns a result. This can help give your Action more personality by making it seem more like it’s in a real place.

You can use the Actions on Google simulator to create and play back your own SSML. You can write an SSML response and then listen to it.

You can learn more about building conversational experiences at actions.google.com.

To learn more about incorporating SSML into your actions, you can check out our previous articles about SSML:

Creating immersive soundscapes with SSML and Actions on Google

Development Process

Improve the User Experience

Written by Nick Felker