How to create audiobooks and podcasts with Text-to-Speech

Laura Baiardi
Women in Voice
Published in
10 min readFeb 2, 2021

With the growing popularity of Voice Assistants and Voicebots, the public is probably ready to welcome Text-to-Speech (TTS) as a “narrative medium”, and one can bet that the audiobooks made with this technology will become increasingly popular.

Although speech synthesis has disadvantages compared to a natural voice, it is becoming increasingly challenging for the listener to recognize a synthetic voice from the human voice. With the new neural voices (NTTS), the difference has narrowed further and listening to a narrative with a synthetic voice is an increasingly enjoyable experience.

In December 2020, Google released 8 fiction and 12 non-fiction audiobooks with Text-to-Speech, which have been distributed through Google Play Public Domain. Among the texts available for free on Google Play Books, there are several classics including Daniel Defoe’s The Life and Adventures of Robinson Crusoe, Bram Stoker’s Dracula, Mary Shelly’s Frankenstein, and Charles Darwin’s On the Origin of Species. The service is currently in beta. Google Cloud’s service for the creation of audiobooks via Text-to-Speech would be available to publishers only. The official release is expected to take place in 2021.

Using Text-to-Speech to edit written content

Using Text-to-Speech for fiction

I started taking an interest in Text-to-Speech in 2018. At that time, I was writing a Western story in Italian. I initially used Text-to-Speech to do a preliminary proof-reading of my novel. By reading aloud with a voice player, it was easier and faster to detect typos, misprints, repetitions and text discordance. The more confident I became with this new medium, the more I realized that its potential went beyond my imagination. I realized that, by utilising several female and male synthetic voices, I could get recited dialogues. This helped me build some fluid interactions between the characters in my story. Thus, Text-to-Speech has proved to be a source of inspiration and a great ally for writing fiction.

With Text-to-Speech, it is relatively easy to create audio files clear of background noise and narrated audiobooks with an almost perfect diction. With this voice technology I published a pilot episode based on my story and I am currently working on a podcast series of my Western novel.

Benefits of Cloud services for Text-to-Speech

4 Cloud Services for Text-to-Speech

If you also would like to create a podcast or audiobook with speech synthesis, you can use several Cloud services including Google Cloud, Amazon Polly, IBM Watson, Microsoft Azure. These are some of the main platforms that provide services for Text-to-Speech. These Services are very affordable, so now let’s look at what they offer in detail.

  1. IBM Watson: IBM’s Cloud platform has three subscription plans. The Lite plan is free for life, while Standard and Premium plans have a fee. There are 11 languages supported and 27 total voice types, most of which also have a neural version, which is more realistic and improved. The synthetic voices offered by IBM Watson are, in my opinion, very close to the human voice. An interesting fact may be the ratio between female and male voices on different Cloud platforms. On IBM Watson, 59% of voices are female, compared to 41% of male voices.
  2. Amazon Polly: Amazon Web Services for Text-to-Speech is free for the first year, after which the cost depends on the number of characters processed. There are 29 different languages available, for a total of 64 different voices. In some languages, including US English, the service supports the neural version. Interesting options are available, such as the bilingual function (Hindi/English) and “conversational” and “newscaster” styles. There are also 3 children’s voices. 67% of the voices supported by Amazon Polly are female, compared to 33% of male voices.
  3. Google Cloud: Google’s Cloud service offers a free trial with a 3-month bonus. There are 42 supported languages, for a total of 140 different voices. The synthetic voices are TTS and Wavenet. On the Google Cloud platform, 56% of voices are female, while the remaining 44% are male.
  4. Microsoft Azure: Microsoft’s Cloud platform offers 12 months of free usage and a bonus; after a year, the paid plan will only be confirmed if the update is activated. Microsoft Azure supports 54 languages and variants, 206 total voices, including 129 neural voices. For some voices, including US English, different diction styles are available: “newscast”, “assistant”, “chat”, and “customer service”. The ratio of male to female synthesis speech in Microsoft Azure is 56% for females and 44% for males.

To begin my experiments, I mainly used Amazon Polly, which is an Amazon Web Service. After a year of free use, to keep track of the costs of the service, I used Amazon Budgets, with which I set a monthly limit. I receive an immediate email alert when the limit is reached. Amazon Polly’s console can process up to 3,000 characters at a time, while Amazon S3 is an object storage service with which you can process up to 100,000 characters at a time. Amazon Polly’s console is very intuitive and its features are within the reach of those with no programming knowledge. The only language you will need to apply is Speech Synthesis Markup Language (SSML), an XML-derived markup language that allows you to modify many important features to make the most of speech synthesis.

Choosing a narrating voice and creating the characters with Amazon Polly

If you are working on a serial story or an audiobook, you will need to select a narrating voice, which will be the main voice of your product, the one that will give personality to the whole story. You should choose this voice with great care.

In addition, it is good practice to give a vocal characterization to the actors of your story. Will the character be young, adult or elderly? Will your character be shy and introverted, exuberant or an authoritarian? The voice can say a lot about the personality of a character and can help give life to a story. Let your imagination run wild and create your characters by analyzing their psychology.

Using SSML tags, you can get different tonalities and gradients from a synthetic voice. You can also create multiple characters from a single voice, intervening on tone and timbre. Amazon Web Services provide a guide that lists all the SSML tags available for the service. Through SSML tags, you can intervene on many of the narrative’s aspects.

With the <phoneme> tag, you can also have foreign voices spoken in the original language, to give life to a character of a foreign nationality. Amazon Polly, in fact, reads phonemes according to the IPA code of the source language of the selected entry. For example, if you use a French voice to read in another language, its pronunciation will have a strong French accent. You can use the <phoneme> tag also to change the pronunciation of phonemes and get different inflections.

Once you have created a range of voices that describe the character, age, and physiognomy of your characters, you can start producing speech from your text.

Optimizing written text with SSML

If you enter a plain text into the Amazon Polly console, you will find that the diction of the spoken output is too mechanical and unnatural. Amazon Polly automatically reads commas and periods as default breaks. Natural speech, however, does not necessarily follow these boundaries: it produces shorter or longer breaks depending on emphasis and breathing.

Since the narrating voice is the one that utters the most sentences, you will need to try to prevent it from singing a song and it will be important that you give a rhythm to the speech, by manually inserting the breaks. What you can do is influence the text entered in the console using the <break> tag. You can create pauses in the text that can range from a few milliseconds to a few seconds. You can also emphasize some words with the <emphasis> tag or you can vary the volume of the voice to accentuate the importance of any exclamations. Similarly, you can obtain whispered words or phrases. Although my advice is to never start a text with these effects and, in general, to use them rarely so that they do not become superfluous to the listener.

Reading speed is very important for an audiobook and it is necessary that the words are punctuated with the correct slowness, to give the listener the opportunity to fully understand the meaning of the narration. With the <prosody> tag, you can slow down or speed up the speech flow.

Another trick to make speech more natural is the <amazon:auto-breaths> tag, which introduces a breath instead of a simple pause during the narration. This tag sets the effect of an automatic breath that occasionally replaces commas in text. However, the result may be too predictable and mechanical, so it would be preferable for you to manually add some breaths in the key passages of the speech. The sigh should humanize speech, falling on a point where the narrator would naturally take a longer break to breathe. Another important thing to keep in mind is to avoid beginning a narration with a breath or a sigh, so as not to distract the listener.

Creating dialogues between different voices with Text-to-Speech

As for dialogues, Amazon Polly does not allow you to use multiple voices in the same console session. However, the problem can be solved by separately creating different parts of a dialogue and mounting them later with an audio editing software. For example, I use the free Audacity software.

When creating a dialogue between two subjects, it is best to remember that the sentences should be rather short, to prevent the flow of speech from taking on a repetitive cadence. Timing is very important, especially when it comes to a comical or a dramatic dialogue. Pauses play a key role here. Often the intonation of a word can vary if it is followed by a punctuation mark, therefore I suggest that you experiment to achieve more natural and realistic effects, even raising or lowering the tone of the voice depending on the expressiveness you want to give to some passages. You can influence the pitch of voice with the <prosody> tag and the timbre with the <amazon:effect vocal-tract-length> tag.

A good practice is to insert the input into the console, edit it with the SSML language, copy and paste it into a script editor such as Notepad++, to save it as a .xml. By doing this, in the future you will have track of all the processed material, and you can easily recreate or edit any part of your text.

Here is an excerpt of text optimized for SSML, taken from chapter 1 of “The Mystery of Cloomber,” by Sir Arthur Conan Doyle.

Excerpt of text optimized for SSML, taken from chapter 1 of “The Mystery of Cloomber,” by Sir Arthur Conan Doyle

Limits and potential of Text-to-Speech for fiction

Creating an audiobook by optimizing text for SSML can be a laborious task. You will need to listen to the result several times to make sure that the words are spoken correctly and that pauses and effects are consistent with the narrative. Post-production work also requires patience and attention to detail. In my opinion, human intervention is necessary in order to give fluidity to the narrative and to restore spontaneity to the speech, which cannot be matched by an automatism. Moreover, I believe that Artificial Intelligence can be an incentive to human creativity and that the work of people remains essential for an artistic result.

For the moment, I want to focus on optimizing text using the SSML and I would like to create an entire story with Amazon Polly. In general, few synthetic voices are available for Italian language, compared to the number and variety of those in other languages. English is the privileged language in this respect. This may, in my view, depend both on the fact that the English language belongs to many different nationalities and for commercial reasons. A further obstacle to overcome is the fact that the AWS service does not currently offer neural voices for my language, Italian. I hope and trust that it will do so in the future and I am convinced that finding creative solutions to the problems is a constructive challenge that helps to put new strategies into practice. I see these technologies as an instrument of accessibility, inclusion and an incentive for creativity for everyone.

Conclusions

Making your own audiobooks with Text-to-Speech is an interesting way to develop creativity and give voice to your ideas. I think that in the future this skill will be more and more important in the working world.

Firstly, by using Text-to-Speech, you can improve your creative writing skills. One can learn many basic notions of phonetics and more attention is paid to the words contained in a text. In addition, this task develops attention to detail and precision in listening.

In conclusion, the use of Text-to-Speech strengthens language and logical skills, as well as being an activity of great personal satisfaction and it’s worth giving a go.

--

--

Laura Baiardi
Women in Voice

Web writer, based in Italy. Passionate about creative writing and Art. I’m interested in new ways of communicating, especially voice tech.