Five things we learned while building our new text-to-speech service

Published in

NZZ Open

6 min readApr 17, 2019

Sie würden lieber eine deutsche Version lesen? Hier lang.

In March 2019, Neue Zürcher Zeitung launched its new text-to-speech service publicly. The audio player is an improved version of the beta player released in October of last year. Here are some things we learned during the process.

1. Google Wavenet is super clever. But when it comes to Swiss German, that’s just not enough

Google Wavenet, the text-to-speech engine we used to generate audio files, is a talented little thing when it comes to languages. So far, it speaks nine languages with a quality that sounds much more natural than other systems. It uses a neural network that has been trained with a lot of speech samples. This allows it to create audio waveforms from scratch that follow the tone sequence and structure of the samples. This works wonderfully for languages the engine already knows and is attuned to. When it comes to Swiss German words, however: not so wonderful.

With Neue Zürcher Zeitung being a Swiss newspaper and some words or names being derived from Swiss German (some strange dialect stuff) or French, our audio service tends to stumble over them. We can’t blame it — we’re the ones that told it to interpret everything in high German, after all.

We have taken these cases (which, incidentally, also affect words from other foreign languages), into account in our solution and equipped a middleware with a lexicon through which all words flow before they are converted into audio. For example, if a “Cervelat” (a Swiss sausage) is referenced in an article, it is converted to “Servellah” in our middleware, which we lovingly call “the Orator”.

Meanwhile, our lexicon counts almost 12'000 entries: our editors’ author abbreviations, the correct pronunciation of a Vietnamese restaurant in Zurich (Co Tschin Tschin), the title of a Black Mirror episode (Ark Äyntschl), you name it.

Of course, we can probably never cover all the exceptions by ourselves — this is why we will introduce a feedback feature for our users in the future.

2. Make your architecture mix-and-match-friendly or don’t make an architecture at all

With that constant wind of change blowing in your face, there are a few things you can do to stay warm: Carry a backpack full of spare clothes. Mix and match. Layer, if needed. The same thing applies to text-to-speech services: In a changing industry with changing tools, needs and products, we needed to build a service that could easily be adapted to changing circumstances. So we opted for our unique structure.

An illustration of our structure by Niklaus Gerber

This way, we were able to move our service from Amazon Polly to Google Wavenet at short notice. The result is a service that has improved by quantum leaps. As Google Wavenet learns on its own, we expect the service to improve quickly. Another major advantage: We can roll out the audio feature in our other products such as the finance vertical “The Market” or our Sunday publication “NZZ am Sonntag” — all we need to do is attach their CMS endpoints.

3. Some people love audio, others just don’t

There are people that love audio. They listen to their morning news through a podcast or on the radio. They don’t read books. And they can’t wait to finally get to listen to their newspaper.

And then, there are people who actually still really like to read their news. They really don’t need to be bothered with this audio business. They just. don’t. need. it.

We asked both user types to evaluate different text-to-speech engines and tested a section of text read by an actual human for comparison. The results weren’t really surprising regarding the preferred voice: Both groups rated the natural human voice the highest.

But the actual exciting insight for us was, when we listened to the reasoning behind their answers: People who already used a lot of audio were not really bothered by worse quality — they just wanted to be able to listen.

On the other hand, people who didn’t really use audio much up to this point said they probably wouldn’t use the service even if the voice were more natural.

Thus, our conclusion: Either you like and use audio, or you don’t like and use audio. The quality of the speaker doesn’t seem to have a relevant influence on the usage.

4. How to make a written piece pleasing to listen to

You can imagine the elements coming out of our CMS like the pieces of one of those pre-fabricated houses which can be put together in lots of different ways.

An article that is published in its text form on our website is one house layout, and the one in MP3 format is a different house layout: They had the same elements to pick from, but were laid out in a different way.

For the audio version of the article, we put all the elements back into the article construction kit and freshly looked into how our users would like an article read to them. Does it make sense to have the article read out according to exactly the same structure as a screen reader, for example? Or would it be better to define a new order or to omit certain elements? We opted for second and defined audio templates, which make our solution unique.

To implement our templates, we took example articles and transcribed them into SSML.

In the templates we determined which elements on the page should be pronounced in which order, which ones should be omitted and which should be pronounced in a different volume. So, read the headline first, (and just a tad louder,) go on with the lead, pause for a bit, then read the author byline. If there’s a new chapter title, read that one just a tad louder again. What we received was an article structure created especially for the audio experience.

5. Many different player experiences might make your life a little difficult

One of many versions: The player in the app

When we set out to implement audio, we knew that we wanted to introduce it across all of our products: On desktop, in the app, and on tablets.

We realized relatively early on that we had to design and develop many, many different player variants: On the web version of Neue Zürcher Zeitung, for example, background playing was not possible with the current technology stack, so we left out the previous and next buttons. In turn, the app needed a minimized player with an integrated image to provide orientation. And then, of course, there were landscape versions that had more space to accommodate different sets of icons.

Sometimes we find ourselves dizzily wondering which version we are currently looking at and which the fine-tuned little differences between all of them are, but then we remember: We’re creating the most pleasant experience for our users across our different products, and that’s something we’re pretty darn proud of.

The audio player experience on the digital NZZ (Web, Desktop)

You can download our app here.