Text to Speech with Real-time Voice Cloning

Allan Souza
Sinch Blog
Published in
4 min readAug 10, 2020

Recently, chatter bots have been used in many services of our day lives. These bots can be built to answer a set of predefined questions or even to develop a humanlike conversation. On the one hand, these bots are very helpful in text-driven services such as virtual assistants, answering doubts, making appointments, confirming delivery, and etc. On the other hand, the text-dependent nature of these chatterbots leads to some limitations related to human interaction and accessibility.

In this way, to empower these capabilities text-to-speech (TTS) systems have emerged, which is a technology to convert written language into human speech, consequently, TTS systems can be used not only as human-technology interfaces to computer-based services, but also as accessibility for the visually impaired people. In this scenario, WAVY which is a top company specialized in improving a customer’s experience through conversational systems based on artificial intelligence and chatbots, started a research to implement TTS systems to enable more efficient and inclusive services.

TTS systems are trained with datasets composed of texts and audios, thus, the system learns the sound (e.g., the waveform) of words, syllables, and letters. However, the resulting voice is the same as the one presented in the training dataset, which means that to produce a specific voice the TTS system needs to be trained with the target voice.

To overcome such a drawback, voice cloning system introduced methods to extract specific characteristics of a target voice (i.e., tone, assent, etc) and apply them to waveforms of a different speech, consequently enabling to change the resulting voice of the previous waveform.

These remarkable technologies became true thanks to advances of Deep Learning models and high performance computing, which pave the way to solve complex problems in appropriate time.

Motivated by combining the aforementioned technologies together Jia et al. [1] proposed SV2TTS (Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis), a TTS system able to generate speech audio in the voice of different speakers, including those unseen during training

SSVTTS Overview

The system is composed of three independently trained neural networks, illustrated in Figure 1: (i) a recurrent speaker encoder, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker (ii) a sequence-to-sequence synthesizer, which based on Tacotron 2 that predicts a mel spectrogram from text, conditioned on the speaker embedding, and (iii) an autoregressive WaveNet vocoder, which converts the spectrogram into time domain waveforms.

In summary, SV2TTS specifically addresses a zero-shot learning setting, where a few seconds of untranscribed reference audio from a target speaker is used to synthesize new speech in that speaker’s voice, without updating any model parameter. For more details about the architecture and methods employed by SV2TTS, please refer to [1].

Demo: TTS with Real-Time Voice Cloning

Corentin Jemine developed a framework based on [1] to provide a TTS with real-time voice cloning. The framework is available in his GitHub repository with a pretrained model for the TTS.

In order to perform the demo, we will use Google’s Colabs environment. To do so, create a new Colabs document and make sure that GPU is enabled Runtime by checking the following configuration Change Runtime Type -> Hardware Accelerator -> GPU. Let’s get started.

First, we need to copy the framework from Corentin Jemine git reposito

!git clone https://github.com/CorentinJ/Real-Time-Voice-Cloning.git

Then, we need to enter in the repository and install the dependencies of the framework

cd Real-Time-Voice-Cloning/# Install dependencies
!pip install -q -r requirements.txt
!apt-get install -qq libportaudio2

Download the pre-trained model and unzip it

!gdown https://tinyurl.com/yyut93rd
!unzip pretrained.zip

To perform the voice cloning, we need to record a short audio as an example of the desired voice. In this way, we need to record an audio sample from the browser using the following code:

Set the pretrained weights to the model

With all set, to perform the TTS and also recording your voice for real-time cloning we need to run following code:

During the execution, first we need to provide the input text that will be synthesized. Then, we need to click on the start recording button to record our own voice. Finally, the framework will extract the characteristics of your voice and apply to the pretrained waveform. The resulting voice can be listened to by clicking on the play button.

The complete code can be found in the following Colab notebook. To execute the code, go to the Runtime menu and click on Run all, or just press Ctrl + F9.

Final Remarks

Corentin Jemine also provided a tool with a user interface, which can be downloaded through his repository as well. A demo video of the tool can be seen bellow.

Acknowledgments: We would like to thank WAVY for the financial support.

References

[1] Jia, Ye, et al. “Transfer learning from speaker verification to multispeaker text-to-speech synthesis.” Advances in neural information processing systems. 2018.

--

--