Increasing the quality of text-to-speech audio

Nowadays text-to-speech engines like Google WaveNet work impressively well, but they’re still not without flaws. Here is what we do to get the automatic readout of ZEIT ONLINE’s articles sounding more human-like.

Maria Gebhardt
Dec 2, 2021 · 5 min read

Imagine the sentence

Leonhard Fuchs was seen onboard a Boeing 737 by DER Touristik on his way to attend the 25. Christmas dinner in the city of Merseburg, Germany. 

being read to you like this:

Original audio file of text to speech conversion via WaveNet

Would you comprehend the meaning? Probably, but it wouldn’t make for a natural listening experience.

Over the last couple of years, we’ve been working closely with ZEIT ONLINE to make their online content more accessible by building an audio infrastructure. It began with developing an application for smart assistants like Amazon Alexa and Google Assistant, which always provides the five most recent news items on ZEIT ONLINE as an audio read-out. Afterward, we set the ambitious goal of going beyond news, speech-synthesizing almost every article on their website into audio. So we developed a web application, codenamed Speechbert, that facilitates the creation of text-to-speech audio versions of articles. ZEIT ONLINE editors use Speechbert to manage, maintain and also improve the audio content generated alongside their articles.

The text-to-speech engine we use to create the audio files is Google’s WaveNet. WaveNet was chosen as it was determined to be ahead of its competition from Amazon or IBM. Nonetheless, we were still not happy with the quality of the created audio in general: These synthetic speech engines read the plain text without taking the context into account. That is fine for a short 45s news item, but if you want to listen to a longer article or an essay, commentary, or interview, good intonation is vital to actually comprehend what you’re listening to. So to provide a good listening experience, the quality of the WaveNet audio files needed to be improved.

For each article, Speechbert generates an SSML (Speech Synthesis Markup Language) markup. So basically every text gets converted from plain text into this markup language that is used for synthesizing speech, giving instructions to WaveNet’s algorithms. By refining and enhancing the article’s SSML within Speechbert before it is sent to WaveNet and also optimizing the audio files afterward, we can improve the quality quite significantly.

What improvements have we introduced?

  1. Structuring the article: As a human reader, you automatically pause between paragraphs and sentences and respond to different punctuation marks. However, a text-to-speech engine generally does not know when a sentence begins and ends. So to receive a more natural intonation, it has to be made clear for the text-to-speech engine through SSML. That’s why we run the text through another Google service called “Natural Language API” — you can send a text and get it back in analyzed form: beginning, end, predicate, object, mentioned persons, and so on. By providing WaveNet the articles in this analyzed form we were able to adjust for example where to make pauses and how long these pauses should be; pauses between sentences would have a different length compared to pauses between paragraphs. Additionally, WaveNet has a character limit per request, so we use our paragraph markers to slice the article into pieces, feed them to WaveNet individually, and then stitch them back together using FFmpeg after receiving the audio files.
  2. Fixing mispronunciation: WaveNet has a default way of vocalizing and emphasizing specific words, not regarding the context or the language the word originates from (especially with names). So we built a Replacement Library that allows, among other things, the replacement of manually defined mispronounced words with accurate pronunciation. It furthermore tells the text-to-speech service when to pronounce acronyms as individual letters or both letters and words instead of a single word.
  3. Filtering out invalid or hard to pronounce characters.
  4. Increasing the reading speed and pitch. Speeding up the readout and, in some cases, increasing the pitch can create a more human-like speech.
  5. Creating higher quality MP3: WaveNet gives the option of either getting heavily compressed audio files (32 kbps) or the uncompressed 16-bit versions, which would send bandwidth and storage costs skyrocketing when handling numerous audio files. So instead of serving the heavily compressed audio files, Speechbert converts the uncompressed audio into a higher quality MP3 (64 kbps) using FFmpeg, an open-source software for processing audio files.

So how would some of these improvements impact the sentence we heard at the beginning of this post? Let’s listen to it again:

Improved WaveNet audio file of the sample sentence

For those interested, the sentence that would go to WaveNet would now look somewhat like this:

<speak>
<s><sub alias=”Laionaard“>Leonhard</sub> <sub alias=”Fooks”>Fuchs</sub> was seen onboard a Boeing <say-as interpret-as=”characters”>737</say-as> by <say-as interpret-as=”characters”>DER</say-as> Touristik on his way to attend the <say-as interpret-as=”cardinal”>25.</sub> Christmas dinner in the city of <sub alias=”Mairseboorg”>Merseburg</sub>, Germany.</s>
</speak>

Doesn’t that look quite confusing and technical? That’s why we equipped Speechbert with a user interface, that (among other things) allows ZEIT ONLINE editors to easily manage and maintain the Replacement Library without having specific technical knowledge.

Showing samples of replacements and how to add new replacements for Speechbert’s Replacement Library
The user interface of Speechbert’s Replacement Library

We’ve been receiving a lot of feedback since ZEIT ONLINE launched their audio offering. There is still a long way for text-to-speech synthesized read-outs to equal the sound of human-read audio, but we’re humbled by the positive feedback we receive and are so excited to be contributing to the accessibility and quality of online news media.

Are you a publisher or newspaper interested in this technology? We recently built a more generic and stand-alone version of the described audio pipeline which can be easily integrated with almost any CMS.
Get in touch with
us or ZEIT ONLINE, we’re looking forward to your inquiries.

As a non-sighted person, I have been using computers with screen readers since 1987. With all this experience in mind, I must extend you a compliment: the quality of your speech output is very good.

I just listened to an article about AI in the Digital section and above all I’m impressed with the synthesised voice. I am blind and usually use my own screen reader voice generator. The article readout feature [on ZON] has the advantage of providing just the article text itself.

To put it mildly, i find this automatic readout breathtaking! it’s so human-sounding that i thought it was human-read at first. i really like listening to audio articles and would be delighted if this feature could be expanded upon.

“The ‘Article readout’ feature is great and I will definitely continue to use it. The quality is very good and when I was listening I couldn’t be sure whether it was a real or artificial voice.”

diesdas.digital is a studio for strategy, design, and code in Berlin, featuring a multidisciplinary team of designers, developers, and strategists. We create tailor-made digital solutions with an agile mindset and a smile on our faces. Let’s work together!

Curious to learn more? We’re also on Twitter and Instagram and we highly recommend our Tumblr. You could also subscribe to this publication to get notified when new posts are published! That’s all. Over & out! 🛰

diesdas.direct

Thoughts, observations and learnings from Berlin-based…