Important Dataset Essentials for TTS Artificial Intelligence Model Training

Mehmet Büyükzincir
turkcell
Published in
4 min readMar 22, 2022

After long hours of struggling with Text to Speech (TTS), we came up with some crucial tips for high quality synthesized audio.

The quality of the artificial intelligence TTS model depends on the training data as well as the neural network. Therefore, below 7 articles will touch upon about the harmony and conformity of the dataset for training.

1 — Audio Data with Zero Noise:

Probably the most important step while preparing your own dataset, is to record clear and noise free audio. The last thing you want will be the breath and mouth noises in short pauses, because your TTS model picks on the little noises in the audio and syntheses audios with hissing sound when there are commas, or pauses in given text sentences.

Figure - 1

As it can be seen on the above figure-1, there are small noises (breathing, mouth noise etc.) in the middle of the audio. If we train TTS with this kind of audio, it will learn that when there is silence, there should be hissing.

So, all the single parts of the audio files should not include any background noise, humming, hiss, smacking mouth or breathing sound. Hence, audio files should be recorded in a professional studio, with a professional voice artist.

Therefore, long sentences should be read in one breath. If we can attend the recording process, it will be better to supervise and verify the audios of dataset.

2 — Audio Data Should Be Better At 22050 Hertz:

Generally professional studio records will have 48 or 44 kHz sampling rate. It’s nice to have detailed high-resolution audio data, but it would be very difficult and time consuming for a neural network to learn that voluminous data.

So before training our neural network, we need to do some down-sampling to our audio data.

At this point it can be discussed that how much we should down-sample it?

Generally accepted answer is 22kHz (exactly 22050 Hz). But it can be varied according to the usage of the TTS model, if it doesn’t have to produce high quality audio, then it can be down-sampled to 8kHz.

3 — Gaussian Distribution of Sentence Durations:

In training data set, we should have short duration audio files which have something around minimum 1 second long, which will probably 1 word or 2 words’ sentences.

Also, we should have long duration audio files up to maximum 25 seconds too. the counts of these files should have a Gaussian distribution. fewer short and long sentences but lots of middle long sentences.

As a result, there should be short, middle and long sentences with a gaussian distribution which we have a peak on medium long sentences.

4 — Word Frequency Check:

Word frequency spectrum should have variety of words. Except frequently used words of language model, we should have a normalized distribution of words. It should have variety of phonemes, for words which have difficult pronunciations.

5 — Silences in the Audio Files:

In the audios of training data, we should not have silence parts in the middle of audios, except punctuations. voice artist should not do any stop, pause, or lag (which cause a silence in audios) when vocalization of text transcripts.

If we have previously prepared training audios, we should clean up the data for silence parts.

The only silence parts can be at the beginning or at the ending of the audio file and they should be only maximum 600 milliseconds long.

Figure - 2

For example,

Figure-2 is the output of a clearance process of Figure-1 audio.

6 — Pace of Speech in the Audio Files:

Pace of speech in the audio files should match with each other. all the text transcript files vocalized with the same pace of speech when recording. It directly affects model’s pace of speech. Also, if the reading of the transcript files is very neutral, the synthesized sound is would be too robotic and without emotions. Therefore, the voice artist should emphasize the tone of the sentences.

7 — Punctuation of Transcript Files:

There should be period, question or exclamation marks at the end of the sentence. If there is no end of sentence punctuation, TTS model cannot predict when to end the sentence, and you have a trailing silence of the synthesized sound.

Also for the question sentences, the voice artist should emphasize on the question mark. If not, question sentences can sound like plain sentences.

In conclusion, audio & transcript data is the most important part of the model training. The data should be very reliable and should not contain contradictions.

I couldn’t have written this article without the support of Tugce Kocak

Regards.

--

--

Mehmet Büyükzincir
turkcell
Writer for

Principal Artificial Intelligence and Machine Learning Engineer at Turkcell