How to include MP3 audio files in your voice design

A quick guide to get you started with SSML

Published in

Sayspring

2 min readDec 18, 2017

Embedding mp3s into your voice UI is one of the most compelling capabilities that SSML (Speech Synthesis Markup Language) has to offer voice designers. Intro music, sound effects, and recorded voices are underused yet powerful elements of voice design projects. And best of all, they’re easy to implement.

Get started including audio in your Voice design

1. Encode your mp3 to make it compatible.

Use converter software to convert your MP3 files to codec version MPEG version 2, bit rate 48 kbps, sample rate 16000 Hz.

Here’s how to quickly do that with free software, Audacity. Here are the instructions directly from Amazon Dev Blog:

Open the file to convert.
Set the Project Rate in the lower-left corner to 16000.
Click File > Export Audio and change the Save as type to MP3 Files.
Click Options, set the Quality to 48 kbps and the Bit Rate Mode to Constant.This requires the Lame library, which can be found at: http://lame.buanzo.org/#lamewindl.

2. Host your encoded mp3. Grab the link to the file.

You must host your mp3 at an internet-accessible HTTPS. The domain hosting MUST have a valid, trusted SSL certificate.

You can host on Amazon Simple Storage Service (Amazon S3). It’s through Amazon Web Services and it meets the above requirements.

3. Insert your sound using simple SSML.

Write your speech inside the speech brackets. Embed your mp3 in the audio brackets. Follow the lead of the example below.

EXAMPLE:
<speak> Thanks.
<audio src=”https://s3-us-west-1.amazonaws.com/sayspring-prod/media/celtic-open-chime.mp3″ />
Your deposit has been processed. What would you like to do next? </speak>

NOTE: If you’re prototyping in Sayspring, you don’t need to use <speak></speak> to insert audio. Only the <audio> /> is necessary.

The audio clip will now play as part of the response in your project.

The above example sounds like this.

Some important limitations to note.

You can use up to 5 audio tags in one singular response.
The time used by all your audio files can’t be more than 90 seconds cumulatively.

Play audio as your entire response, or as an accompaniment to a voice response. The audio tag lets you include sound effects, earcons and short music. If your brand has a particular voice, you can include recordings of that in your design.

Audio is a compelling and memorable way to brand a voice-first user experience.

Think of the NBC chimes, the McDonald’s “I’m Lovin’ it” jingle, or the Law & Order dun-dun. With this simple code, all voice designs have the same capability to have a more emotional, more delightful and more memorable brand.

Originally published at www.sayspring.com on December 18, 2017.