Talk To Me (with TTS)

A general overview of 2023’s current neural text-to-speech solutions

Matthew A. Pagan
The Modern Scientist
8 min readJul 14, 2023

--

Philip K. Dick published his novel Clans of the Alphane Moon in 1964. In the story, Dr. Mary Rittersdorf, a psychologist, travels to a moon of Alpha Centauri with Daniel Mageboom, who tells her that he is an official representative of the America’s propaganda production outfit.

Despite the numerous conversations she holds with him throughout the course of her mission, Rittersdorf fails to recognize that Mageboom is in fact a realistic android (a simulacrum) being operated remotely from Earth.

The plot of this classic SciFi novel rests on a vision of these robotic simulacra with speaking capabilities so flawlessly fluid that citizens of countries around the world are unable to detect that their eloquent vocalizations are of a mechanized synthesis and not of biological production.

Cover art by Davis Meltzer; c.1964; 1972; 2nd ACE printing

A brief history of synthesized speech

Today, state-of-the-art speech synthesis approaches similar levels of plausibility. Text-to-speech (TTS) is an old technology, and originally its dissimilarity to human speech was so jarring that hearing it was all-but painful. Bell Labs demoed an early speech-synthesizing machine at the 1939 New York World’s Fair. Later, in 1961, Bell Labs’ computer scientist John Larry Kelly Jr. programmed the IBM 704 to sing Harry Dacre’s 1892 song “Daisy Bell (Bicycle Built for Two)”. Science fiction author Arthur C. Clarke witnessed this vocoder performance in-person, which inspired him to write a fictional supercomputer HAL that performed a similar feat.

One of the most popular open-source speech synthesis programs, espeak, saw its initial release in 2006, and has been included for years by default in many Unix-based desktop operating systems, such as MacOS and Debian Linux.

Such approaches to artificial speech generation relied on a process of concatenating phonemes, known as formants. Known combinations of letters in the input text mapped to individual sounds, and application logic would string them together to simulate word-sounds.

A major breakthrough came in 2016 with a paper from Google DeepMind, titled “WaveNet: A Generative Model for Raw Audio” (arxiv), which demonstrated improved results in generative audio using deep learning techniques, employing models trained on large datasets of human speech recordings.

Since then a number of consumer Text-to-Speech options have become available, leveraging deep learning techniques to improve the realism of the audio quality.

State of the Market

I’m a big fan of open source solutions, if they exist. Unfortunately, a usable open-source open-model text-to-speech tool that leverages deep learning has not really appeared yet. Open-source speech synthesis tools available from public git repositories, such as gnuspeech, still use a traditional concatenative approach (although research advancements have complexified this implementation, and thus improved the output product).

Below I will explore features, drawbacks, and code snippets for the following neural TTS services (all of which are paid, but all of which also have free trial periods or free tiers):

  • Coqui
  • 11labs
  • Azure
  • AWS

Coqui

https://app.coqui.ai/studio

One library I have found, TTS by Coqui.ai, can be called from within a python script, or through a CLI tool. Their python library calls an API on their servers, so although they have code on GitHub, the mechanics of how it works is their commercial trade secret. Nevertheless, they have released a number of open tools for training your own speech synthesis models, as well as some open research on neural text-to-speech.

Their Coqui Studio product provides a feature-rich web interface that allows for layering multiple voices on top of one another.

Price-wise, you need to buy a plan to go beyond their 5-minutes-worth of free trial usage. Their cheapest plan is $20 USD per month, which gets you 4 hours of audio generation using their V1 model or 2 hours of audio generation via their XTTS model, per month.

As of this writing, Coqui has fifty-eight different voices available as default choices in Coqui Studio, aside from their additional capability to let users drop in their own custom voice models.

Below is some basic Python code using Coqui to generate an MP3 audio file:

from TTS.api import TTS
from pydub import AudioSegment

# Initialize the TTS with a specific model
tts = TTS(model_name="tts_models/en/ljspeech/glow-tts", progress_bar=False, gpu=False)

# Generate the speech and save it to a file
tts.tts_to_file(text="Hello world!", file_path="output.wav")

# Convert the WAV file to MP3
audio = AudioSegment.from_wav("output.wav")
audio.export("output.mp3", format="mp3")

Amazon Polly

https://aws.amazon.com/polly/

Amazon has their own neural speech generation tool available through AWS, called Amazon Polly.

You can generate speech programmatically or through their slick web GUI. Text longer than 3000 characters can’t be downloaded directly from this page, but instead must be uploaded to S3 and downloaded that way. If you already use AWS, you might know that their CLI tool is called aws and their python library is boto3.

I’ve included below some basic Python code for generating a neural Text-to-Speech mp3 file with AWS Polly:

import boto3

# Create an AWS Polly client
polly_client = boto3.Session(
region_name='us-west-1' # Replace this with the region of your choice
).client('polly')

response = polly_client.synthesize_speech(
VoiceId='Emma',
OutputFormat='mp3',
Text='Hello World'
)

# The response contains the audio stream in the AudioStream key.
audio_stream = response.get('AudioStream')

# Write the stream to an mp3 file
with open('hello_world.mp3', 'wb') as file:
file.write(audio_stream.read())

audio_stream.close()

# To automatically close the stream
if audio_stream.closed == False:
print("Audio stream is not closed.")
else:
print("Audio stream is closed.")

While adequate for most speech tasks, the number of voices available is limited to a fixed, small number.

Amazon Polly is available in thirty-one different languages and language variants. These “variants” include seven different varieties of English: American, Irish, British, Indian, South African, New Zealand, and Australian. If you ever expect to look for a variety of authentic-sounding voices, you might find yourself dipping into one of these other dialects. Within American English, there are only two different adult male-sounding voices. Some of the AWS-provided voices with male names, such as “Kevin,” are designed as infantile-sounding child voices.

AWS pricing for Neural TTS is fairly affordable at $16 USD per one million characters. Their pricing page estimates that text that length would produce 23 hours and 8 minutes of audio.

11labs

https://beta.elevenlabs.io/speech-synthesis

This company was only founded in 2022, but already it’s become a popular neural TTS solution. As far as pricing, 11labs has a free tier allowing users to generate audio for up to 10,000 characters per month. However, users on the free plan must agree to attribute 11labs with any generated audio content, whereas paid plan users (starting at $5 USD per month) do not necessarily have to.

Here’s some basic boilerplate Python code to generate an mp3 file using the elevenlabs python library:


from elevenlabs import generate, play

# Generate audio
audio = generate(
text="Hi! My name is Bella, nice to meet you!",
voice="Bella",
model="eleven_monolingual_v1"
)

# Play audio
play(audio)

# Save the audio to an MP3 file
with open("output.mp3", "wb") as out:
out.write(audio)

Their Speech Synthesis portal has an uncluttered web interface. Don’t let the simplicity deceive you though: 11labs has a number of features unmatched by the large cloud-provided generative services.

One thing that I see making 11labs a popular choice for text-to-speech is its VoiceLab, giving users the ability to design custom voices by tweaking parameters of age, gender, accent, and accent strength.

Paid plan users also have the option to “clone” an outside voice based on sample audio. If users upload an audio file of the person whose speech they want cloned into the web interface, the Speech Synthesis portal will then offer a voice option based on that sample as the voice that will speak the input text.

Azure Cognitive Services Speech

https://speech.microsoft.com/portal

Many people I’ve talked to prefer Microsoft Azure for neural Text-to-Speech. It requires a little more setup to get started, since Azure requires you to create a number of “resources” on the platform prior to allowing access to the audio generation screen.

Azure AI has a web interface that allows you to generate speech from text, one voice at a time, in the browser. I especially like that certain voices (not all of them) allow you to modulate the mood of the spoken audio so that you can make “Jenny” sound angry, cheerful, whispering, shouting, or other tones-of-voice.

As of this writing, Azure offers twenty-eight different voices to choose from. Eight of those have customizable moods, called “speaking styles.”

In terms of pricing, Microsoft Azure has a free-trial-type pricing for the first 12 months after your initial sign-up, much like AWS does. After that, standard pricing on Azure’s pay-as-you-go payment method costs $1 USD per month for every 1000 text records submitted for TTS.

Below I’ve written some example Python code to generate a short audio clip using Azure Cognitive Services:

from azure.cognitiveservices.speech import SpeechConfig, SpeechSynthesizer, AudioOutputConfig

# Setup speech configuration
# Replace with your own subscription key and service region
speech_config = SpeechConfig(subscription="YourAzureSubscriptionKey", region="YourAzureServiceRegion")

# Setup audio output configuration
audio_output = AudioOutputConfig(filename="hello_world.mp3")

# Create a speech synthesizer using the default speaker
speech_synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)

# Synthesize the text
result = speech_synthesizer.speak_text_async("Hello, world!").get()

# Check result
if result.reason == result.Reason.SynthesizedAudioCompleted:
print("Speech synthesized to speaker for text [{}]".format(result.text))
elif result.reason == result.Reason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == cancellation_details.Reason.Error:
if cancellation_details.error_details:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you update the subscription info?")

Be aware that, as with all the cloud-provider-specific code snippets here, you will need to perform a number of setup tasks through their web interface, such as establishing a payment method, before any code will work.

Conclusion

Hopefully this overview has given you some idea of the options available to you for realistic text-to-speech generation. All of the platforms discussed use some form of neural-network training to produce human-sounding speech. In fact, there are even newer research breakthroughs in the field of neural speech synthesis which have yet to be implemented in publicly-available working demos.

If you have some text, and you need to generate audio for a project requiring plausible-sounding speech, then hopefully one of these platforms can meet your needs. While some of them are more time-consuming to get up and running than others, none of them require any coding ability. I’ve provided Python implementations for each of them as a bootstrap for those who like it. All of these neural TTS providers have fully functional web interfaces, which allow users to type the text they want spoken and download an audio file of the speech.

--

--

Matthew A. Pagan
The Modern Scientist

Artist/programmer writing about technology's collisions with history, literature & philosophy.