Scaling Audio Service: How we launched a high-quality Text-To-Speech service at “Neue Zürcher Zeitung”
Today at Neue Zürcher Zeitung, a Swiss publisher of high quality journalism for a German speaking audience, we successfully launched a Text-To-Speech (TTS) audio player feature as a beta version in our web and mobile channels. It is a service to our on-the-go and audiophile users that makes consuming news in an alternative way simple and convenient.
How we defined the need for a Text-To-Speech service
With voice technologies like Amazon Alexa and Google Home emerging and becoming ubiquitous in our lives, media companies have been focusing a lot of their output towards delivering news with audio in the last few years.
Our users have also been voicing their wish for more audio content. Classical podcast formats are not enough anymore — users want to have the choice between reading and listening to an article. We started wondering about the amount of articles to be transformed into audio. Should we have professional, native speakers cover a certain amount of articles a day, as seen, for example in ‘The Economist’’? Or is it a greater need for our users to be able to access all of our articles as audio?
In user interviews, we identified that our users wanted to be able to choose whichever article they wanted to listen to freely. Thus, we decided that we wanted to offer all articles published as audio. With the large load of text — NZZ publishes up to 200 articles on its digital entities per day — recording with native and professional speaker was not an option. The phase of finding an automated and scaling conversion of text to audio began.
How we defined the business case
NZZ’s strategy is to stay profitable in the long term through digital subscriptions. With this goal in mind, NZZ’s digital product relaunched in November 2017. We introduced a two-stage funnel as a driver for conversion. In a first step, a user has to register to be able to access features such as our bookmark function or personalization service. In a second step, after users have consumed a certain amount of content, users are requested to sign up for a paid subscription.
By filing our new audio functionality into the first step of the funnel, users have to register to be able to make use of the new service. Situating our new audio service behind the mandatory registration allowed us to not only seamlessly implement it in our business model, but also makes it another trigger to drive conversion.
Mature design patterns in the audio domain
With state-of-the-art audio players such as those of Spotify, Apple Music, Acast or Sonos already being heavily represented in our everyday life as apps and services, we decided that we didn’t have to re-invent the design of an audio player and that it would be mostly confusing to our users to do so. We therefore drew from already existing players in many of our design decisions. When it came to new functionalities, such as the changing of the speed, we iteratively designed, co-created with our users and tested different versions until we arrived at the design we have today. We will however continue to do so over the next months as the use case is still new to the industry.
How we built the service with its unique structure
When it came to the actual conversion from text to speech, we looked into several TTS services, including IBM’s Watson, Amazon Polly, and Google Wavenet (just to name a few). For starters, we began working with Amazon Polly.
Knowing that any TTS service on the market would develop rapidly over the coming months, we had to build an architecture that would be flexible enough to react favourably to change (e.g. replacing the TTS engine). Our need to be flexible is what led us to opting for our unique structure: The text runs through our self-built middleware — we call this Orator — where words like “z. B” are replaced into “zum Beispiel” (German for “for example”) or abbreviations are changed from “boa.” into “Boas Ruh” (one of NZZ’s editors). The text is then transformed into SSML and afterwards sent through the TTS engine where an MP3 is generated.
SSML, Speech Synthesis Markup Language, provides us with a standardized method for controlling different aspects of speech synthesis output. For example, with SSML, one can alter prosody attributes, such as rate, pitch, and volume, insert pauses of any length, change the speaking voice while reading, and control many other aspects of how the text is read by the synthetic voice. The great thing about SSML is that basically the same input can be fed to any TTS engine: Whether it’s Amazon Polly or Google Wavenet, they all follow the same commands — with some small exceptions. And because the output of our middleware lexica feeds into the SSML, any of the components of our structure can be replaced.
Just few weeks before the beta launch of our audio player functionality, DeepMind’s Wavenet released its TTS service fo the German Language. Thanks to the flexibly chosen architecture, we completed the change at short notice. The result was a service that was better by quantum leaps. We expect this rate of continuous improvement to accelerate further and that the human nativeness of the voice will massively increase over the coming months.
How to activate the service during beta
As a registered user, you will need to activate the service during its launch as a beta version. To do so, open the settings menu in the burger navigation on mobile (both iOS and Android), navigate to “Beta-Functionalities” and activate it by toggling the feature “listening to articles”. On Web, navigate to Menu and add the functionality in the settings (“Einstellungen”).
Acknowledgements: This implementation from the idea to the beta-prototype was possible thanks to the outstanding performance of the product development team (Niklaus Gerber, Luisa Bider) and the technology department (Vovka Fertak and Notarmon Buchli) — thank you very much for your enthusiasm and energy of the last months.
Liked this article?
Give it a clap 👏 and share the article! You can as well follow my initiatives here!