A Deep Dive Into SSML

Sam Ursu
Sam Ursu
Jan 29 · 15 min read

After I published my latest game, Starship Zumanji, I got a lot of very wonderful messages from fellow chatbot designers who wanted to know more about how I built it.

The short answer is: I used Chatfuel to design the game as well as build the Facebook version because, well, Chatfuel is a pretty darn awesome design tool. I then “translated” it to DialogFlow for the Google Assistant (“Action”) and Telegram versions. But the audio was all done with Google’s crippled version of SSML.

That’s why today, I wanted to drill deeper into how I used SSML to create what might generously be called a “rich audio” experience.

SSML Basics

SSML stands for “synthesized speech markup language” which is a fancy way of saying “a few HTML-style tags for computers so they can render some audio elements.”

In other words, you don’t need to be a coder to work with SSML. If you can get your brain around writing <bold>be bold!</bold>, then you can handle SSML.

The official guide to all things SSML is here, but if you’re building a “voice bot” for Google (called an “Action”) or for Alexa (called a “Skill”), then you need to be aware that they each use a limited, restricted version of SSML.

The official page for Google’s “lite” version of SSML is here.

The official page for Amazon’s slightly kooky version of SSML is here.

Unfortunately, as I began working with Google Assistant last year, I discovered that not everything you can actually do with SSML on their system is listed in the documentation.

And some of their documentation is a little unclear or just plain wrong, especially with regard to the SSML functions that they’ve disabled.

Getting Started with SSML

For the purposes of this article, I’m going to assume that you’re working in the Google environment. Amazon has slightly different rules (such as not using the <speak> tag in their Java SDK), so always pay close attention to each platform’s requirements.

SSML is used when you want tell a computer how to “render” audio, especially TTS (text to speech). Unfortunately, synthesized voices suffer from a lot of shortcomings. The biggest and most important one is that computerized voices have no emotion.

A TTS engine will read a sentence like “I am hopelessly and madly in love with you” and “I bought some bread at the store” in the exact same lifeless, dry way.

Therefore, unless you’re designing a voice bot to read the news, you’re going to need to use some SSML to help your spoken audio sound more interesting.

SSML lets you modify a few things, but the three most important are:

  • Speed (at which text is spoken)
  • Pauses (between audio/speech elements)
  • And how to time the playing of multiple, overlapping audio elements

In this article, I’ll show you precisely how to use your limited palette of SSML tags to wring the maximum amount of emotion out of computer voices.

Mixing in Audio

The “secret sauce” to making an immersive good voice bot experience is not just modulating the speaking voice with SSML but also mixing it with audio sound effects and background music/audio.

To get started with some background effects, go to the Google Sound Library where you can find a few hundred free files. You cannot download them, but conveniently, Google hosts the files, so you can “hot link” them directly in your voice bot/app.

In the Google Sound Library, type “background” in the search box and you’ll find one called Factory Background.

Here’s its URL: https://actions.google.com/sounds/v1/ambiences/factory_background.ogg

Now let’s add a simple bit of SSML to create an audio of a person speaking two sentences while “standing on” a factory floor:

<speak>
<audio src=”https://actions.google.com/sounds/v1/ambiences/factory_background.ogg” />
I am in a factory! Yes, this is the factory where I work.
</speak>

Head on over to the Google Actions Simulator, click on the “Audio” tab, and paste in the code above. Then click “update and listen” to hear it.

The “Audio” tab of the Google Action simulator

Unfortunately, what you’ll discover is that the Factory Background file will first play from start to finish (it is 38 seconds long) and then the “person” will start to speak.

That’s no good!

What we really want is for the background audio to play for a second or two, and then have the person talk “over” the background sound.

But if you just reverse the order of the elements…

<speak>
I am in a factory! Yes, this is the factory where I work.
<audio src=”https://actions.google.com/sounds/v1/ambiences/factory_background.ogg” />
</speak>

…then you get the same problem. The person will speak first, and only when they stop speaking will the background audio will start.

The key is learning how to play them simultaneously.

The All-Important <Par> Function

I suppose that the <par> tag is supposed to mean “paragraph”, but I’m not really sure.

All you really need to understand is that all of the elements located between <par> tags can overlap one another rather than have to be played sequentially.

Let’s imagine that I want the Factory Background file to play for two seconds before the person begins to speak.

Here’s how you do it:

<speak>
<par>
<media xml:id=”background”>
<audio src=”https://actions.google.com/sounds/v1/ambiences/factory_background.ogg” />
</media>
<media xml:id="talking" start="background.begin+2s">
<voice>
Here I am on the factory floor! Yes, this is where I work.
</voice>
</media>
<par>
</speak>

Now, the Factory Background audio file starts playing immediately. Two seconds later, the person says their lines. And then the Factory Background audio file plays until it finishes.

This isn’t ideal, but it’s a lot closer to what we might want to accomplish.

The Essential Elements

Here’s the order of SSML tags:

<speak> = Goes at the very beginning and very end of your code block.

<par> = Goes before and after all of the audio elements to indicate that they can be overlapped rather than played sequentially.

<media> = Each individual audio element is separated by this tag.

Make sure you close these three in the same order as you opened them!

<voice> = For the text-to-speech (TTS) elements, i.e. the computer “speaking aloud” text.

<audio> = Any pre-recorded audio file (could be music, sound effects, or even pre-recorded human speech).

Later, in this article, I’ll show you how to get really precise control over how all of these different elements can be mixed in order to create a very immersive experience.

Voice Control

The “person” who does the speaking is customized by using the <voice> tag.

For Google, you’re limited to just four voices — two female, and two male.

If your local settings are set to English with a USA locale, you’ll get the “2+2” package of two American female voices and two American male voices. Likewise, the same for English with a UK, Australia, or India locale.

At this time, Google does not let you control (with SSML) which “2+2” locale-specific package of voices to use. This means that you cannot have an Australian man talking to an American man.

You choose your default locale and language when you create your Google “Action,” but if you create an “American” voice bot and your user is in Australia, the user will hear the Australian voices.

Unfortunately, while vanilla SSML does allow you to specify which accents/locale you want, Google has this feature turned off (for now — I’ve already put in a “request” for this feature to be turned on).

Note: Amazon, on the other hand, does let you use SSML to access any voice in their repertoire.

This means that all of your speaking elements are limited to two men and two women (or just 1 woman and 1 man in some languages) who are all native speakers of just one language.

If you type this:

<speak>
Hello there!
</speak>

Then you’ll get whatever the default voice is (for your Google Action).

You might think that choosing the Male 1 default voice for your Google Action will result in every user hearing the Male 1 voice, but this isn’t so. Google lets users override or customize their Action’s default voice, so if a user has chosen the Female 1 voice as their default voice, they will hear that female voice for your Action instead of what you want, the Male 1 voice.

Luckily, Google does let you “override the (gender) override” and choose exactly which one of the four voices that you want to use.

Here’s how you do that:

<speak>
<voice gender="male">
I'm a man. Let's talk about things from a male perspective.
</voice>
<voice gender="female">
I'm a woman. I prefer to discuss things from a female perspective.
</voice>
</speak>

The above code will have the default Male voice read aloud the first two sentences while the default Female voice will read the last two sentences.

But what about two different men talking to each other?

<speak>
<voice gender="male" variant="1">
I'm a man. Let's talk about things from a male perspective.
</voice>
<voice gender="male" variant="2">
I'm also a man. But I'm a different man than you are!
</voice>
</speak>

Effectively, this means that you’re limited to four different native speaker voices.

For most voice apps/bots, that’s probably enough. But I always like to take things to the next level, which is why Starship Zumanji has no fewer than six different characters, each with a different voice! 😀

You can use SSML tags to further customize these 2+2 voices. For example, let’s imagine that you want a little boy to talk to an adult man:

<speak>
<voice gender="male" variant="1">
<prosody pitch="+5st">
Hello father!
</prosody>
</voice>
<voice gender="male" variant="1">
Hello my son.
</voice>
</speak>

In this case, the <prosody> tag is being used to raise the pitch of the first speaker. The “higher” the pitch, the younger the person will sound.

It’s not exactly perfect, but the two “people” now sound quite distinct even though they’re the exact same computer voice. Pretty cool, eh?

Likewise, you can create a creepy voice by lowering the pitch quite a lot:

<speak>
<voice gender=”female” variant=”1">
<prosody pitch=”-9st”>
Welcome to your worst nightmare!
</prosody>
</voice>
</speak>

Therefore, when creating your “voice” characters, play around with the male/female variants as well as the pitch until you find a signature sound for each one.

For example:

  • Fred = Male 1, normal pitch
  • Velma = Female 1, +1st (semitone) pitch
  • Daphne = Female 2, normal pitch
  • Shaggy = Male 2, +1st (semitones) pitch
  • The Museum Owner Who, In a Shocking Twist, Is Also the Mysterious Ghost= Male 2, -7st (semitones) pitch

This will help keep them all sounding distinct.

Additional Audio

Let’s say that you want some cool background audio or Foley sounds or maybe even some music for your voice bot, but you can’t find what you’re looking for in Google’s Sound Library.

Obviously, you can find audio files lots of places around the internet, but SSML requires that every audio file be accessible via a public URL.

For Starship Zumanji, I used a lot of sound effects and background sounds from the BBC Sound Effects Library because they’re available for free for non-commercial purposes.

Unfortunately, those files, though, can’t be “hot linked” directly, which means that I need to host them somewhere myself.

I hate the idea of paying money to host some audio files for a free game, so I decided to download the files that I needed from the BBC Sound Library and then host them on my personal Google Drive. I figured that was pretty good karma, especially for a Google “voicebot” game.

The problem, though, is that I needed to do two things:

  • Make each file publicly accessible to everyone (including people who haven’t logged into a Google account); and
  • Get a standard HTTP URL for each file.

Luckily, there’s a lovely website called GDurl that will let you do just that. Either connect your Google Drive (the easiest method) or paste in each file’s reference link one at a time in order to create a public URL for each audio file that you want to use.

Note: Your audio files won’t have a URL that ends in “.ogg” or “.mp3” but they will work (and play) just fine.

Google Drive does “throttle” your personal files if they get accessed too often, but that only seems to kick in when you truly get a LOT of traffic. I’ve certainly never encountered any problems.

Putting Everything Together

Now you’re ready to put everything together in one lovely, complex mess of nesting SSML tags in order to amplify the emotional impact of your spoken audio.

Yahoo! 🎉

Audio

Audio files can be customized in several ways:

  • When they start playing (relative to other elements)
  • When they stop playing (relative to other elements)
  • The fade in duration (sound starts at zero and ramps up to full volume)
  • The fade out duration (sound fades from full volume to zero)
  • The volume (how loud/quiet they are)
  • How many times they are repeated (default, of course, is 1 time)
  • Their reference name

TTS Voice

The speaking (text-to-speech) voice can be customized in several ways:

  • Gender (male or female)
  • Variant (i.e. selecting which of the existing voices to choose from)
  • The pitch (how “high” or “low” the tone is)
  • Volume (how loud/quiet it is)
  • Rate (how fast/slow the voice speaks)
  • Break (inserting a pause before reading the next word, usually measured either in seconds (“s”) or milliseconds (“ms”)

I’ve already described how to select the gender and variant.

The other parts are done using the <prosody> tag.

Note: There are other voice tags that you can use such as <emphasis> and <speak-as>, but I’ll skip these for now for simplicity’s sake.

<prosody> elements can be stacked together:

<prosody rate="90%" pitch="-1st" volume="110%">Here's some modified speech!</prosody>

Again, see the SSML documentation for exactly which parameters you can use for the <prosody> tag.

XML Names

It’s a really good idea to name all of your different <media> elements inside the <par> tags.

First, it helps you remember what each audio segment is about, which is especially useful if you’ve got files with numerical or nonsensical names or you’re going to play multiple audio files that overlap and you need to remember which one is which.

Secondly, and most importantly, it allows each <media> element to reference another <media> element.

For example, you can have one <media> element begin at a set time after/before a different <media> elements starts/ends.

Here’s a sample:

<speak>
<par>
<media xml:id="DMV" fadeInDur="1s" fadeOutDur="3s" soundLevel="+1dB" end="talking.end+3s" repeatCount="1">
<audio src="https://actions.google.com/sounds/v1/ambiences/dmv_background_noise.ogg"></audio>
</media>
<media xml:id="talking">
<speak>
<voice gender="male" variant="1">
<break time="5s" />
<prosody rate="95%" pitch="-1st" volume="110%">
Wow, I cannot believe how long this line is. This is why I hate waiting at the DMV!
</prosody>
</voice>
</speak>
</media>
</par>
</speak>

In this case, the background DMV (in America, this refers to the government agency that issues driver’s licenses) audio file begins playing immediately.

It fades in over a period of 1 second and plays at 1 decibel louder than its original volume.

Meanwhile, the speaking part begins 5 seconds after the DMV background audio has started.

The Male 1 voice is selected for the speaking part. The <prosody> tags downshift the pitch one semitone, speak at 10% louder volume than standard, and speak at a speed that is 95% that of the standard speed.

After the speaking part is done, the DMV audio will continue to play for three more seconds, and then everything comes to an end.

Note: How you set the volume for an audio file and how you set the volume for a spoken element is done differently!

By the way, you might have seen that there are two sets of nested <speak> tags in the above code snippet. That’s because the first <speak> tag alerts Google that it needs to start processing some SSML. The second <speak> tag tells Google that it’s time to actually speak aloud (in a human voice), otherwise known as TTS (text-to-speech).

Note: If you have more than one “person” speaking, you will need a separate set of nested <speak> tags for each one.

A Trip to the DMV!

Okay, now that you understand how all the basic elements come together, let’s build a little audio skit with what we’ve learned.

In this story, a man will be waiting in line at the DMV. While he’s waiting, he will have a short conversation with his (male) friend.

He will then hear a “bing” sound and his number called over the PA (UK: tannoy). After that, he will engage in a short conversation with a female employee that will make his friend laugh.

This means we need four voices:

  • The man (Male, Variant 1, -1 semitone pitch)
  • His male friend (Male, Variant 2)
  • The PA voice (Female, Variant 1, -5 semitones pitch)
  • The DMV employee (Female, Variant 2)

We also need three sound effects:

  • The DMV background noise
  • The “bing” sound
  • The sound of a man laughing

To make things ultra simple, all the sound effects will come from Google’s free sound library so that anyone reading this can literally copy+paste the code below and get a feel for how the parts work together.

Putting all of the above together, this is what you get:

<speak>
<par>
<media xml:id="DMV1" fadeInDur="1s">
<audio src="https://actions.google.com/sounds/v1/ambiences/dmv_background_noise.ogg"></audio>
</media>

The Google Assistant test console allows you to download your audio masterpieces with one click of a button. The output format is a monoaural MP3, which is conveniently a very small file (just a few kilobytes in most cases).

You can hear the above SSML code in its final form here: https://gdurl.com/KE40

A couple of notes:

  • I put a line break between each <media> element so that it’s easier for you to read. But these line breaks are not necessary.
  • The DMV background audio file is 73 seconds long, but the conversation lasts for 80 seconds, so I had to use the DMV audio twice. But I didn’t use (repeatCount=”2") because I needed to have the second DMV audio treated differently because it ends prematurely whereas the first instance of the DMV audio plays through to the end.
  • While you can cue different audio files based on when a <media> element ends or begins, you cannot do the same for the <voice> speaking parts. Those are treated by Google as all loading simultaneously, so the only way to space them out is by using <break> tags and counting out the seconds from the start of the file.
  • Sometimes, Google’s Test Console goes bananas and doesn’t work, even if all of your SSML is perfectly legitimate. If you’ve been working/testing for a while and everything unexpectedly goes haywire, try closing the browser window and reloading it.
  • Although Google’s SSML webpage says that clipEnd and clipBegin are attributes that you can use, they do not work!!! Trust me, I learned this the hard way.
  • At the moment, it is impossible to begin an audio file anywhere other than at the beginning. To end it prematurely, use an attribute like the one above (end=”laughing.end+3s”) to count out how long it will end before/after another <media> element starts/finished.
  • While you can use SSML to choose how fast a voice element speaks, you cannot change how fast an audio file is played.
  • Feel free to use the Google Action test console to create a complex audio file and then download it for use in other integrations or platforms. That’s exactly how the Facebook and Telegram versions of Starship Zumanji are set up. The Google “Action” version renders the SSML on the fly, but the Telegram/FB versions just serve up a standard MP3 file.

Summary

When you see all the SSML typed out, it looks like a heck of a lot of work just for creating a few seconds of audio.

Trust me, I know!

But don’t let the length or the fact that it looks like a lot of “programming” gibberish intimidate you.

SSML lets you fine-tune the entire audio experience. You can precisely layer in background audio, Foley sound effects, and other kinds of audio in combination with your spoken voice parts to create something really unique.

It’s probably best to think of yourself as composing an entire orchestra work, writing each musical instrument separately, and then choosing exactly how they all will go together for the “concert”.

And since there’s literally no way to get a TTS voice to express any emotion, you’ll have to rely on Foley sound effects, human sound effects (like crying, laughing, etc) and background audio to create some emotional tension.

Lastly, my recommendation is that you start with TTS only before you start adding in audio files. This will help you get the hang of pacing your audio to help the TTS voice(s) sound less monotonous. Practice speeding up parts of a sentence, adding pauses, and tweaking the pitch to make each voice sound a tiny bit more natural.


Have fun building your (voice) chatbots, everybody!

The Chatbot Guru

An expert’s view on the emerging field of digital assistants.

Sam Ursu

Written by

Sam Ursu

The best way to reach me is by email at: samcelroman@gmail.com

The Chatbot Guru

An expert’s view on the emerging field of digital assistants.

More From Medium

More from The Chatbot Guru

More from The Chatbot Guru

The Fundamentals of NLP Design

46

More from The Chatbot Guru

More from The Chatbot Guru

The 3 Types of Chatbots

More from The Chatbot Guru

More from The Chatbot Guru

Designing Chatbot Personalities

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade