Tune by Example: How to Tune Watson Text to Speech for Better Intonations

Marco Noel
IBM Watson Speech Services
13 min readApr 21, 2021

Co-authored by Rachel Liddell and Marco Noel, Product Managers, IBM Watson Speech Services

Head of a Les Paul guitar showing the tuner knobs
Photo by Thomas Thompson on Unsplash

About two years ago, we introduced our latest generation of Text-to-Speech (TTS) Neural Voices, bringing state-of-the-art natural-sounding voices into voice solutions. With Speech Synthesis Markup Language (SSML) tags, you can tweak the speed, pitch and pronunciation of words and expressions of the voice in your application.

Rhythms and intonations are the parts of a voice that make it sound truly natural. Using the existing SSML tags and phonemes to create this type of voice can be challenging and time consuming and the results might not even meet your expectations. To improve the experience of voice applications and meet the challenges of their development, we are introducing a new feature to make our neural voices sound even more natural. On that note, let’s introduce Tune by Example.

Introducing Tune by Example

Tune by Example allows you to make more changes to Watson voices with less effort. It enables you to adjust only Watson voices to achieve the exact cadence and intonation you want (although for now it is only available for our US English Enhanced Neural voices).

a person using a digital sound mixing program with a mouse and keyboard. There are a few pieces of sound equipment like a euro-rack and subwoofer near them.
Photo by Obi Onyeador on Unsplash

This new TTS customization feature allows you to improve naturalness and specify speaking style. All you need is a recording of a phrase delivered how you want the Watson voice to speak it. The service then adjusts the synthesized voice to mimic the speaking style in the recording.

What can be “tuned”

Tune by Example is a simple and intuitive way to adjust the naturalness, intonation, and cadence of a voice. To improve the user experience on some specific utterances, you might want to put specific intonations on specific words or at the end. Here are some ways Tune by Example can be applied:

  • Emphasize the “or” or the “and” in a list. “Do you want sausage, peppers, AND pepperoni?” or “The deal includes two meat toppings OR a three vegetable toppings. Not both.”
  • Change the duration of a particular word or syllable of a word. For example, extend the second syllable of the word “hello” in “why hello there!” so it sounds like “why hellooo there! How can I help you?”
  • Adjust the degree of questioning or command in a sentence. For instance, you can make the question “Really?” sound more incredulous. Our voices already use an inquiring tone when there’s a question mark and an authoritative tone when there’s an exclamation mark, but Tune by Example allows you to amplify or reduce these effects.
  • Add upward or downward inflection to a phrase, such as adding downward inflection to the end of this phrase, “I have good news to give you, sir.”
  • Reduce pauses in a phrase. For example, tune a voice to read through a list more quickly, “We offer a high, low, and medium deductible plan.”

None of these examples can be applied using SSML alone. They require Tune by Example, which gives more control to the user to adjust the speaking style of Enhanced Neural Voices.

Benefits

Using Tune by Example, you can fine-tune phrases you plan to reuse in your Text to Speech implementation. For example, you can adjust the pitch and speaking rate of the greeting message for your virtual agent.

You can also implement the adjustments listed above without becoming an expert in SSML. That means members of your Customer Experience team can simply make a recording using their own voice on how they want to adjust the sound of the virtual agent.

Finally, you gain increased control over the sound of Enhanced Neural Voices. For example, you cannot use SSML to add an upward or downward inflection. You can only make the changes listed above using Tune by Example.

How it works

In this section, we will walk you through the two features of Tune by Example:

  • Creating prompts without Speaker model training
  • Creating prompts with Speaker model training

To make things easier for you, we made a Github repository with bash scripts, curl commands, sample audio files and sample text files that I will refer throughout this article.

To get started, pull this repository locally, and then update the following files below. All the scripts will use these common configurations

cfg.sh:>>>>> url="url_of_TTS_Cloud_instance" -- put the URL of your TTS Cloud instance>>>>> voice=en-US_EmilyV3Voice -- comment or uncomment the voice you wish to use from the list
tts-credentials.txt:
>>>>> apikey:your_TTS_Cloud_API_Key_no_quotes -- keep "apikey:" and put the API key of your TTS Cloud instance after the colon with no quote
A person using a sound mixer board with a screen in front of them as well.
Photo by Tom Pottiger on Unsplash

Create your baseline

The first step is to identify the phrase(s) you want to tune. I created a text file called “test-utterances.txt” with the lines below

Do you want sausage, peppers, and pepperoni?
The deal includes two meat toppings or a three vegetable toppings. Not both.
why hello there!
why hello there! How can I help you?
Really?
I have good news to give you, sir.
We offer a high, low, and medium deductible plan.

For your baseline comparison, you want to generate each text utterance with an out-of-the-box neural voice. Let’s use the EmilyV3 voice. Run the following script

./1-SythTTS-Audio-Std-Baseline.sh test_utterances.txt

NOTE: you might see this error "grep: tts-custom-model.txt: No such file or directory" - don't worry about it for now, since we have not created a TTS custom model yet.

Check and see if you have a folder created with the following eight files:

021821-141648test_utterances.txt-baseline\AudioList.txt
Do_you_want_sausage__peppers__and_pepperoni_-std.wav
I_have_good_news_to_give_you__sir_-std.wav
Really_-std.wav
The_deal_includes_two_meat_toppings_or_a_three_vegetable_toppings__Not_both_-std.wav
We_offer_a_high__low__and_medium_deductible_plan_-std.wav
why_hello_there__How_can_I_help_you_-std.wav
why_hello_there_-std.wav

Then listen to each WAV audio file and note the tone and intonation. This is your baseline comparison.

Create your audio training data

Now let’s start by creating individual audio files, one for each utterance, with the intonations you wish to have. Each audio file must be in WAV format and must have a sampling rate of no less that 16 kHz. Audio quality is important in order to get optimal results.

The repository contains audio samples for each text utterance with an exaggerated intonation, and the pairs are saved in a text file named “test_utterances.edb” as shown below. This file will be used later with a bash script.

sausage_peppers.wav:Do you want sausage, peppers, and pepperoni?
toppings_not_both.wav:The deal includes two meat toppings or a three vegetable toppings. Not both.
hello_there.wav:why hello there!
hello_how_can_I_help.wav:why hello there! How can I help you?
really.wav:Really?
good_news.wav:I have good news to give you, sir.
high_low_medium_deductible.wav:We offer a high, low, and medium deductible plan.

For the speaker model experimentation, the repository contains a couple more audio files to train the speaker model feature. Just like the previous ones, these audio files must be in WAV format with a sampling rate of no less that 16 kHz and must not exceed 60 seconds. It can contain anything — this is to collect the speaking style and intonations.

a person speaking into a professional microphone while sitting at their desk in front of a computer and using a keyboard
Photo by Standsome Worklifestyle on Unsplash

The first audio file contains all the exact same basic utterances contained in the file “test_utterances.txt” we used earlier to generate our baseline.

test_utterances.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

The second audio file contains some general text below with a speaking style

speaker_training.txt:
Hello there! Welcome to PizzaCo! How can I help you today?
Do you want sausage, peppers, and pepperoni on your pizza?
The deal includes two meat toppings or a three vegetable toppings. Not both.
We offer a high, low, and medium soft drink with your order.
Really? - This is very interesting...
Well, I have good news to give you, sir.
speaker_training.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

Experimentations of prompts with no speaker model

The quickest way to use Tune by Example is to build a custom prompt directly with an audio file. No speaker model is required. Using a custom prompt is similar to “hard-coded” text with the new intonations. As you will see in the example below, it must be the only thing that appears in the synthesis request and you cannot include additional text. If the text changes, you must re-create a new prompt with the new text, then update the synthesis request with the new prompt id.

Create your TTS custom model and prompts

To configure the custom prompt, you first need to create a TTS custom model. This is the “container” in which the custom prompts will be stored in the TTS instance.

To create a TTS custom model named “Medium-TBE-Demo”, use the bash script below.

./2-createTTSCustomModel.sh Medium-TBE-Demo

The script generates a text file called “tts-custom-model.txt” containing the TTS customization id — this file will be used by the other bash scripts

{"customization_id": "f62f748d-48b7-47b1-97e0-3a5312ce78d1"}

To get a status of the TTS custom model you just created, run the following script

./3-getTTSCustomModels.sh

You will get the following JSON information about the TTS custom model: the owner ID specific to your TTS instance, the customization ID that we will use later in the TTS to “synthesize” an API call, the unique name, the description, the language used, and finally, the prompts it contains.

{
"owner": "abcd1234-ab12-ab12-ab12-abcdef1234567",
"customization_id": "f62f748d-48b7-47b1-97e0-3a5312ce78d1",
"created": "2021-02-18T19:24:38.498Z",
"name": "Medium-TBE-Demo Model",
"words": [],
"description": "Medium-TBE-Demo custom voice model",
"language": "en-US",
"last_modified": "2021-02-18T19:24:38.498Z",
"prompts": []
}

Creating prompts with your text utterances and matching audio

The next step is to create a prompt in your TTS customization by mapping a text prompt with a WAV audio file containing the text prompt.

When running the following script

./4-createTTSPromptsNoSpeaker.sh test_utterances.edb

you should see the output below in your terminal. Note the status “processing”, meaning the training process has started. One should show up for every single prompt.

Creating prompt  sausage_peppers_noSpeaker  from  sausage_peppers.wav  audio file with text utterance  Do you want sausage, peppers, and pepperoni? ............
{
"prompt_id": "sausage_peppers_noSpeaker",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "processing"
}
...

To check the status of your prompts, you run the script below:

./5-checkStatusTTSPrompts-NoSpeaker.sh test_utterances.edb

You should see the following results. Note the status is now “available”, meaning the training is complete and the prompts are ready to use. Similar to the above output, one should show up for every single prompt.

Checking status from prompt  sausage_peppers_noSpeaker  .............
{
"prompt_id": "sausage_peppers_noSpeaker",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "available"
}
...

Testing the newly created prompts

To use the prompt in the playback, you must provide the prompt id you wish to use (see the step above).

curl -X POST -u $useCred --header "Content-Type: application/json" --header "Accept: audio/wav" --data "{\"text\":\"<ibm:prompt id='good_news_noSpeaker'/>\"}" --output good_news.wav-tbe-noSpeaker.wav "$url/v1/synthesize?customization_id=$customID&voice=$voice"

Run the following script to run your experiments with the text utterances for your new Tune by Example

./6-SythTTS-Audio-TbE-NoSpeaker.sh test_utterances.edb

You should see a new folder below with your new audio files

021821-142948test_utterances.edb-tbe-noSpeaker\good_news.wav-tbe-noSpeaker.wav
hello_how_can_I_help.wav-tbe-noSpeaker.wav
hello_there.wav-tbe-noSpeaker.wav
really.wav-tbe-noSpeaker.wav
sausage_peppers.wav-tbe-noSpeaker.wav
toppings_not_both.wav-tbe-noSpeaker.wav

Listen to the newly generated WAV files. You should notice a difference. You can complete as many of these cycles on as many phrases as you want. Note that it may take several attempts to get the tuning just right. This is an iterative process, just like all customizations.

Optional: Experimentations with a speaker model for optimal results

The speaker model trains the service on your speaking style to improve the output of Tune by Example. Recordings of a particular individual will be associated with the same speaker model. Although this step is optional, for the best results, you should create a speaker model with a sample of speech. If you have several people providing recordings, you will need separate speaker models for each individual. Speaker models are optional, but they do improve results. Plus, it’s a one-time training to improve the quality of all the prompts for that speaker.

a person wearing headphones working on a computer with a screen showing a temporal graph of audio signals in a software program.
Photo by Kelly Sikkema on Unsplash

Create a speaker model

A speaker model is created at the TTS instance level and is thus independent from TTS custom models. You can train multiple independent speaker models. You can link any speaker model (aka speaker style) to any custom prompt. The same speaker model can be associated with multiple prompts that are defined in different TTS custom models.

To create a speaker model, run the following script with the WAV file you created earlier containing your speaking style.

For this example, let’s use “test_utterances.wav” which lists each utterance as-is with an exaggerated speaking style

./7-createTTSSpeakerModel.sh test_utterances.wav

The speaker model id will be stored in the JSON file with the same name as the audio file. Since the script below uses the base filename of the audio file, make sure it contains valid characters.

test_utterances.json:{"speaker_id": "9bfdd81c-d048-43bc-a12b-a33bd8498597"}

Now, let’s create another speaker model but this time, with the general free-form text “speaker_training.wav” with another speaking style

./7-createTTSSpeakerModel.sh speaker_training.wav

This generates a separate speaker id

speaker_training.json:{"speaker_id": "e2dc9fe9-ff07-4d4b-8cd5-30760de9d909"}

To see a list of all the speaker models you have in your TTS instance, run the following script

./8-listTTSSpeakerModel.sh

You should see something similar to the following

{"speakers": [
{
"speaker_id": "9bfdd81c-d048-43bc-a12b-a33bd8498597",
"name": "test_utterances"
},
{
"speaker_id": "e2dc9fe9-ff07-4d4b-8cd5-30760de9d909",
"name": "speaker_training"
}
]}

Creating prompts with speaker model

Now that you have created your speaker models, you need to create a prompt in your TTS custom model, then map the speaker model to a text prompt with a WAV audio file containing the text prompt.

Run the following script with the EDB file as the first argument and the JSON file with the speaker ID you wish to use

./9-createTTSPromptsSpeaker.sh test_utterances.edb test_utterances.json

You should see this output in your terminal . Notice the “speaker_id” which means that the prompts have been trained with the speaker model. Every single prompt should be showing up.

Creating prompt  sausage_peppers_test_utterances  from  sausage_peppers.wav  audio file with speaker id  9bfdd81c-d048-43bc-a12b-a33bd8498597  with text utterance  Do you want sausage, peppers, and pepperoni? ............
{
"speaker_id": "9bfdd81c-d048-43bc-a12b-a33bd8498597",
"prompt_id": "sausage_peppers_test_utterances",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "processing"
}
...

Let’s run the same script again with the same EDB file but this time with the other JSON file with the other speaker ID

./9-createTTSPromptsSpeaker.sh test_utterances.edb speaker_training.json

The results below are with the other speaker ID and prompt names and show up for every single prompt.

Creating prompt  sausage_peppers_speaker_training  from  sausage_peppers.wav  audio file with speaker id  e2dc9fe9-ff07-4d4b-8cd5-30760de9d909  with text utterance  Do you want sausage, peppers, and pepperoni? ............
{
"speaker_id": "e2dc9fe9-ff07-4d4b-8cd5-30760de9d909",
"prompt_id": "sausage_peppers_speaker_training",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "processing"
}
...

To check the status of your prompts for speaker ID “test_utterances.json”, you run the scripts below

./10-checkStatusTTSPromptsSpeaker.sh test_utterances.edb test_utterances.json

The status should now show as “available” for all prompts

Checking status from prompt  sausage_peppers_test_utterances  .............
{
"speaker_id": "9bfdd81c-d048-43bc-a12b-a33bd8498597",
"prompt_id": "sausage_peppers_test_utterances",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "available"
}
...

Run this script to check status on speaker ID “speaker_training.json”, you run the scripts below

./10-checkStatusTTSPromptsSpeaker.sh test_utterances.edb speaker_training.json

The status should now show “available”

Checking status from prompt  sausage_peppers_speaker_training  .............
{
"speaker_id": "e2dc9fe9-ff07-4d4b-8cd5-30760de9d909",
"prompt_id": "sausage_peppers_speaker_training",
"prompt": "Do you want sausage, peppers, and pepperoni?",
"status": "available"
}
...

Using the following scripts, you can pull some more details around the speaker models, based on the JSON file as the argument

./11-listTTSSpeakerModelDetails.sh test_utterances.json./11-listTTSSpeakerModelDetails.sh speaker_training.json

These details are stored in “test_utterances-details.txt” and “speaker_training-details.txt” for future reference. Open these files and review the information. The prompts should all show a status of “available” thus ready to be tested.

Testing the new prompts with the speaker model

Run the following scripts to run your Tune by Example experiments with the text utterances but this time, with each speaker model training (“test_utterances.json” and “speaker_training.json”)

./12-SythTTS-Audio-TbE-Speaker.sh test_utterances.edb test_utterances.json./12-SythTTS-Audio-TbE-Speaker.sh test_utterances.edb speaker_training.json

You should see the new folders below with your new audios

021821-154949test_utterances.edb-tbe-test_utterances\good_news.wav-tbe-test_utterances.wav
hello_how_can_I_help.wav-tbe-test_utterances.wav
hello_there.wav-tbe-test_utterances.wav
high_low_medium_deductible.wav-tbe-test_utterances.wav
really.wav-tbe-test_utterances.wav
sausage_peppers.wav-tbe-test_utterances.wav
toppings_not_both.wav-tbe-test_utterances.wav
021821-155008test_utterances.edb-tbe-speaker_training\good_news.wav-tbe-speaker_training.wav
hello_how_can_I_help.wav-tbe-speaker_training.wav
hello_there.wav-tbe-speaker_training.wav
high_low_medium_deductible.wav-tbe-speaker_training.wav
really.wav-tbe-speaker_training.wav
sausage_peppers.wav-tbe-speaker_training.wav
toppings_not_both.wav-tbe-speaker_training.wav

Listen to the newly generated WAV files. You should notice a difference. You can complete as many of these cycles on as many phrases as you want. Note that it may take several attempts to get the tuning just right. As noted above, this is an iterative process, just like all customizations.

See detailed documentation around custom prompts and speaker models here.

Feature Limitations

Tune by Example is not a dynamic feature: it does not adjust automatically as the text changes. You must configure it before using it. The best use of this feature is with static content like a greeting, a disclaimer, an acronym or a product name, not with dynamic content like an account balance. The dollar amounts themselves will be different for each user. So, to tune them, you would need an infinite number of prompts, which would not be very efficient!

SSML is still a powerful tool! Keep using SSML for specific adjustments like pitch, speed, phonetic pronunciations, and text normalization.

Tune by Example is only available for Enhanced Neural Voices for US English. You need to be using a TTS Standard plan to leverage the feature. Go and try it!

  • You can find the Watson Text to Speech documentation here. Find the Tune by Example documentation here.
  • You’ll need to set up an IBM Cloud account if you do not have one already. You can do so here .
  • To learn more about TTS, you can go through the TTS Getting Started video HERE

Happy Tuning!

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own