Control digital voice speech and pitch rate using the Watson Text to Speech (TTS) library

Abhilasha Mangal
IBM Data Science in Practice
5 min readDec 21, 2023

Many consumers prefer to listen to digital text rather than read it in today’s digital environment, and Text-to-speech (TTS) technology can help you turn digital text into audio. IBM Watson Text to Speech is a service that converts text into natural-sounding speech in a range of languages, dialects, and voices using speech synthesis capabilities. This service also includes a number of settings for modifying the characteristics of speech synthesis such as speech rate and voice pitch.

Text to Speech Dash app

IBM Watson’s text-to-speech model is built using machine learning techniques and deep neural networks, trained on large amounts of speech and text data. The model uses a combination of statistical methods and neural network architectures to generate natural-sounding speech from text inputs.

This blog gives an overview of how to convert text data into speech and how to control speech rate & voice pitch using Watson Speech libraries. Here, we are installing a single-container TTS service on our local machine with a Docker. To learn more about using the single-container TTS service you can see here. After running the docker TTS service using the below steps you can embed easily Watson TTS in your application.

1. Data Processing and EDA (Exploratory Data Analysis)

Speech synthesis services require that the data be in a JSON format. There are many escape characters that are in the text which are not valid for a JSON string. Those characters must be replaced in the text. You can use the following code snippet to clean your text before sending it to synthesize request.

Data pre-processing

2. Text to speech synthesize

Use the Watson Text to Speech Service POST /v1/synthesize method to request a synthesize. To access this service, you can use the Python requests library. You can set the parameters and synthesize-service endpoint as shown below.

Text-to-speech service

After the post request, you can save the audio output in your local directory or the cluster.

Save speech data

This audio output can be printed and played in a Python Jupyter Notebook.

Speech data output

3. Modifying speech synthesis characteristics

The Watson Text-to-Speech Service includes query parameters that you can use to globally modify the characteristics of speech synthesis for an entire request.

3.1 Rate Percentage

You can use the rate_percentage query parameter to modify the rate of the synthesized speech for a voice. The parameter value can be an integer that represents the percentage change from the voice’s default.
For example:
1. Specify a signed negative integer to reduce the speaking rate by that percentage. For example, -10 reduces the rate by 10%.
2. Specify an unsigned or signed positive integer to increase the speaking rate by that percentage. For example, 10 and +10 increase the rate by 10%.
3. Specify 0 or omit the parameter to get the default speaking rate for the voice.
Using the following code, you can pass the rate_percentage value per the requirement in the TTS service’s request parameters.

Rate percentage parameter

3.2 Pitch percentage

You can use the pitch_percentage query parameter to modify the pitch of the synthesized speech for a voice. Each voice has a preset, baseline pitch that corresponds to the tone it is designed to convey. The parameter accepts an integer that represents the percentage change from the voice’s default.
For example:
1. Specify a signed negative integer to reduce the pitch by that percentage.
For example, -10 lowers the pitch by 10%.
2. Specify an unsigned or signed positive integer to increase the pitch by that percentage. For example, 10 and +10 increase the pitch by 10%.
3. Specify 0 or omit the parameter to get the default pitch for the voice.
By using the following code, you can pass the pitch_percentage query parameter value in the TTS service.

Pitch percentage parameter

4. Analysis of playback customer calls

Companies can use the TTS service to create this box model to capture English voices without customization. To demonstrate that, you can pass multiple texts into a speech synthesis service to save various files.

TTS app speech data output

Conclusion

This blog showed how you could easily use the Watson Speech Library to convert text to speech and control digital speech and pitch rates. To learn more about the TTS service, you can download the code from GitHub. One use case you could try is using this TTS service to build a mobile to-do app that allows users to capture voice memos to save as written to-do items, or a simple out-of-the-box model to capture English voices without any customization for further training purposes.

Embeddable AI
You can start your AI journey by browsing & building AI models through a guided wizard here. The IBM Build Lab team is here to work with you on your AI journey. For more information, Embeddable AI Webpage.

You can also additionally browse the collection of Embeddable AI self-serve assets at Tech Zone and on GitHub

--

--