The Intuition Behind Voice Cloning with 5 Seconds of Audio
A guide to the paper “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”
Nobody wants to listen to a robotic text-to-speech (TTS) program drone on about whatever.
With the paper, Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis by Jia, Zhang, and Weiss et al., we don’t have to anymore.
The authors propose a new technique (often called Speech Vector to TTS, or SV2TTS) for taking a few seconds of a sample voice, and then generating completely new audio samples in that same style of voice.
How do the authors do this?
Let’s take a look.
In this article, we will cover the intuition behind the voice cloning technique developed by Jia, Zhang, and Weiss et al. We will not be implementing any code this time.
- Understanding of mel spectrograms
- Intermediate knowledge of machine learning
The SV2TTS model is composed of three parts, each trained individually.
This allows each part to be trained on independent data, reducing the need to obtain high quality, multispeaker data.
The Speaker Encoder
The first part of the SV2TTS model is the speaker encoder.
The speaker encoder’s job is to take some input audio (encoded as mel spectrogram frames), of a given speaker, and output an embedding that captures “how the speaker sounds.”
The speaker encoder does not care about the words the speaker is saying, or about any noise in the background, all it cares about is the voice of the speaker, e.g., high/low pitched voice, accent, tone, etc.
All of these features are combined into a low dimensional vector, known formally as d-vector, or informally as the speaker embedding.
As a result, utterances spoken by the same speaker will be close to each other in the speaker embedding, while utterances spoken by different speakers will be far apart in the speaker embedding.
To learn to generate these embeddings, the authors describe the following process:
- First, the examples of speech audio are segmented into 1.6-second clips with no transcript and transformed into mel spectrograms.
- Then the speaker encoder is trained to take two audio samples and decide whether or not the same speaker produced them. As a byproduct, this forces the speaker encoder to create embeddings that represent how the speaker sounds.
The training process is similar to Siamese neural networks if you are familiar with them.
The synthesizer is the part of SV2TTS that analyzes text input to create mel spectrograms, which the vocoder later converts into sound.
The synthesizer takes a sequence of text — mapped to phonemes (the smallest units of human sound, e.g., the sound you make when saying ‘a’), along with embeddings produced by the speaker encoder, and uses the Tacotron 2 architecture to generate frames of a mel spectrogram recurrently
To train the synthesizer we: (visual below)
- First, we collect a sequence of phonemes and a mel spectrogram of a speaker uttering that sentence.
- Then, the mel spectrogram is passed to the speaker encoder to generate a speaker embedding.
- Next, the synthesizer’s encoder concatenates its encoding of the phoneme sequence with the speaker embedding.
- The mel spectrogram is generated recurrently by the decoder and attention parts of the synthesizer
- Finally, the mel spectrogram is compared to the original target to create a loss, which is then optimized
At this point, the synthesizer has created a mel spectrogram, but we still can’t listen to anything yet. To convert the mel spectrogram into raw audio waves, the authors use a vocoder.
The special vocoder used here is based on DeepMind’s WaveNet model, that generates raw audio waveforms from text, and was at one point state-of-the-art for TTS systems.
If you are interested, I would recommend reading DeepMind’s blog post about WaveNet to learn more.
And that’s it!
As a quick recap, here is the flow of the model:
- The speaker encoder listens to a given audio sample and generates an embedding
- The synthesizer takes a list of phonemes and the speaker embedding, then generates a mel spectrogram
- A neural vocoder deciphers the mel spectrogram into an audio waveform we can listen to
Measuring the Result Quality
We can’t innovate if we don’t have a way to measure how much better our systems are than previous techniques. Let’s take a look at how the authors solve this.
The authors use human raters to measure the naturalness and similarity of speech the model produces.
Naturalness measures how “human” the speech sounded, while similarity measures how similar the synthesized speech sounds to the original speaker.
The raters interacted through a GUI that looked a bit like:
The technique of using crowdsourced ratings such as this is often referred to as Mean Opinion Score or MOS.
The authors found that the final results for naturalness ended up looking like:
If you would like the full analysis of this, I would recommend reading section 3.1 of the original paper. However, in summary:
- The proposed SV2TTS model achieves about 4.0 MOS in all datasets.
- LibriSpeech (one of the training datasets) is classified as less natural, as there is no punctuation in the transcript, making it difficult for the model to learn pauses.
- The embedding table system uses a lookup table of speaker embeddings but otherwise has the same architecture. Because it has a lookup table, it cannot generalize, and therefore cannot be evaluated on unseen datasets.
- The naturalness of unseen and seen examples are incredibly similar, which is impressive.
And as for speech similarity, the results ended up looking like:
Again, you may want to read section 3.2 of the original paper for the full analysis. But in summary:
- Scores are higher for VCTK (the other training dataset) again, showing the more structured form of the dataset.
- The SV2TTS model scores between “moderately similar” and “very similar” on the evaluation scale for unseen speakers.
- Although this is hard to measure mathematically, the model overall captures the characteristics of the speaker accurately.
That’s pretty much it!
The model can take only a few seconds of audio, and realistically generate new audio.
You can find audio samples at https://google.github.io/tacotron/publications/speaker_adaptation/
Implementing and training the SV2TTS model from scratch is far beyond the scope of this article. Although there is no official implementation released to the public, there are a few robust implementations built by the community.
Corentin’s implementation has both a UI and a CLI for interactive audio generation. The UI allows you to record a voice sample yourself (or select a file to load audio from) type some text, then visualize the embeddings and generate novel audio.
You can watch his brilliant video on it here:
Results may vary as Corentin states on one of his GitHub issues, “My implementation of the synthesizer and of the vocoder aren’t that great, and I’ve also trained on LibriSpeech when LibriTTS would have been preferable.”
Nonetheless, the UI is designed brilliantly, and his implementation is the best publicly available that I have found.
You can read instructions on how to set up the environment on the official GitHub page.
SV2TTS can power hundreds of applications involving traditional TTS, including chatbots and virtual assistants like Siri and Cortana. It can also give people who have lost their voice or are unable to speak, e.g., due to ALS, the chance to connect with the world like they never could before.
But being able to clone a voice with only five seconds of input audio is dangerous too.
As this technology progresses, it will be entirely possible to fake a declaration of war or a threat from a political figure, indistinguishable from reality. Combined with Deepfakes, whole forged videos can be made of whatever one wants.
We cannot stop this technology from being developed; all we can do is develop accurate (and trustworthy) algorithms for accurately detecting whether audio has been synthesized.
And as always, until next time,
Check out the original paper here: https://arxiv.org/pdf/1806.04558.pdf