Voice Cloning Using Deep Learning

Latest Technology Trends And Threats

Published in

The Research Nest

9 min readFeb 6, 2020

I remember the time when voice assistants were first introduced on smartphones. Although the capabilities of voice assistants were very limited back then. But, the concept of talking to a phone got everyone excited. One of the limitations of early voice assistants was that they used to sound very unnatural and robotic🤖. But, as time progressed, both spoken and functional capabilities of these assistants have also evolved. Now, the accent, tone, pitch, etc. of voice agents sounds very natural. It all has become possible because of the advancements in the field of artificial intelligence and text-to-speech over the years.

Neural networks can now take just a few seconds of your speech and generate entirely new natural-sounding audio samples. What’s more, these synthetic voices may soon be indistinguishable from the original audio samples. If you try to examine the several examples of voice cloning🔥, it’s easier to appreciate the breadth of what the technology can do including being able to switch the gender of the voice and alter the accents and styles of speech.

Slowly we are moving toward a voice-driven world. The consumption of audio content and voice-based automated services is on the rise. Many content creators are moving to platforms like SoundCloud and Amazon’s audiobook service, Audible. It can also be sensed from the fact that tech giants like Google, Amazon, Samsung, Apple, etc, are investing heavily in their voice-based services, and very often they claim to be better than their counterparts.

With these advancements, soon we will be able to customize the voice of various voice agents as we like. Imagine Morgan Freeman reading your shopping list or Amitabh Bachan guiding you through the traffic while using the navigation or what about cloning the voice of your deceased loved ones for your voice assistant(ok! that’s creepy) 😨. For actors, it will become easier for them to dub their movies in different languages. People who have lost their voice can use this technology to communicate. This technology has also opened the door for companies such as Lyrebird to provide new services and products. They create voices for chatbots, audiobooks, video games, text readers and more, all powered by AI. Lyrebird claims to be able to clone voices with just a minute of a person’s audio sample.

Now let’s go over some breakthrough developments in the field of voice cloning in recent years.

1. WaveNet 🌊

It was introduced in 2016 by Google-backed DeepMind in the paper WaveNet: A Generative Model for Raw Audio. Earlier text-to-speech systems (TTS) were largely based on the concatenative TTS. In this approach, first, a very large database of short speech fragments is recorded from a single speaker. Then these fragments are recombined to form the complete utterances. The downside of this approach is that you will need a totally new database of audio samples if you want to make minor tweaks to the voice, like altering the emphasis or emotion. Also, the audio samples generated by this approach are very unnatural, glitchy and robotic.

On the other hand, WaveNet is a parametric TTS, where all the information required to generate the data is stored in the parameters of the model. You can control the characteristics of the speech via the inputs to the model. WaveNet generates its output by directly modeling the raw waveform of the audio signal, one sample at a time. The audio samples generated by WaveNet sound more natural compared to concatenative TTS. I would suggest you check out the speech samples on DeepMind’s blog to understand the difference in the naturalness of speech synthesized by these approaches.

The main ingredient of WaveNet is the stack of casual convolutions. Here the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.

Basically, in a dilated convolution, the filter is applied over an area larger than its length by skipping input values with a certain step. This is similar to pooling or strided convolutions, but here the output has the same size as the input. So, there are no pooling layers in the network, and the output of the model has the same time dimensionality as the input.

By modifying the speaker identity, WaveNet can be used to say the same thing in different voices. Similarly, additional inputs such as emotions or accents could be provided to the model, to make the speech even more diverse and interesting. It can also mimic non-speech sounds, such as breathing and mouth movements. Since WaveNets can be used to model any audio signal, it can also be used to generate music!

💥Check out this great blog to learn more about WaveNets.

💥TensorFlow implementation-

ibab/tensorflow-wavenet

This is a TensorFlow implementation of the WaveNet generative neural network architecture for audio generation. The…

github.com

A drawback of autoregressive models like WaveNet is that they tend to learn local structure much better than global structure. It is more noticeable when modeling high-dimensional distributions.

MelNet is a generative model that addresses this problem using multiscale modeling. According to the paper, MelNet can capture the high-level structure of audio very well.

Here’s the research paper for your reference-

MelNet: A Generative Model for Audio in the Frequency Domain

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of…

arxiv.org

2. Deep Voice 🗣

Deep Voice is a TTS system developed by the researchers at Baidu. Its first version, Deep Voice 1 was inspired by the traditional text-to-speech pipelines. It adopts the same structure, but replaces all components with neural networks and uses simpler features. First, it converts the text to phonemes and then uses an audio synthesis model to convert linguistic features into speech.

The latest version of this project is Deep Voice 3, which uses a fully-convolutional character-to-spectrogram architecture. Its architecture enables fully parallel computation which trains faster than the recurrent networks. Its architecture is inspired by transformers (Vaswani et al). Deep Voice 3 was the first TTS system to scale to thousands of speakers with a single model.

The architecture of Deep Voice 3 consists of three components:

Encoder: It is a convolutional neural network that converts the textual features to an internal learned representation.
Decoder: The decoder is used to decode the learned representation coming from encoder into a low-dimensional audio representation (or Mel spectrograms). It consists of casual convolutions with multi-hop convolutional attention and it generates its output in an autoregressive manner.
Converter: It is a fully-convolutional post-processing network. The converter predicts final parameters from the decoder hidden states. This is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The following vocoders can be used in the converter of Deep Voice 3: Griffin-Lim vocoder, WORLD vocoder, and WaveNet vocoder.

In late 2018 the team of Deep Voice released the paper: Neural Voice Cloning with a Few Samples. The paper introduced a system based on Deep Voice 3 for the task of voice cloning. This system can clone voice in a matter of seconds with very few numbers of samples, in fact, the average time for cloning is 3.7 seconds!

Two approaches were adopted by Baidu’s researchers: Speaker adaption and Speaker encoding. Both approaches can deliver good performance with minimal audio input data and both of them can be integrated into the deep voice system without degrading the quality of the system.

Speaker Adaptation: The objective of this approach is to fine-tune a trained multi-speaker model for an unseen speaker using a few audio-text pairs. Fine-tuning can be applied to either the speaker embedding or the whole model.
Speaker Encoding: The speaker encoding method directly estimates the speaker embedding from audio samples of an unseen speaker. A model like this does not require any fine-tuning during voice cloning. Thus, the same model can be used for all unseen speakers.

💥Unofficial implementation-

SforAiDl/Neural-Voice-Cloning-With-Few-Samples

We are trying to clone voices for speakers which is content independent. This means that we have to encapture the…

github.com

💥Audio samples-

Audio demos

The multi-speaker model and speaker encoder model were trained on 84 VCTK speakers (48 KHz sampling rate), voice…

audiodemos.github.io

3. SV2TTS 🚀

The paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(SV2TTS) describes a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers. The thing that stands out for this system is that it can work in the zero-shot learning setting 👌🏻.

“This method aims to solve a task without receiving any example of that task at the training phase”

So, it can generate new speech in the voice of a previously unseen speaker, using only a few seconds of untranscribed reference audio, without updating any model parameters.

The SV2TTS system consists of three independently trained components. This allows each component to be trained on independent data, reducing the requirement of high-quality multispeaker data. The individual components are:

Speaker Encoder Network: The speaker encoder network’s job is to take audio of a given speaker as input, and encode the characteristics of their voice into a low dimensional vector embedding. It does not care about what the speaker is saying, all it cares about is how a speaker is saying something. The network is trained separately on the task of speaker verification, using a dataset of noisy speech from thousands of speakers. The encodings are then used to condition the synthesis network on a reference speech signal from the desired target speaker.
Synthesis Network: It is a Seq2Seq neural network based on google’s Tacotron 2 that generates a Mel spectrogram from the text, conditioned on the speaker embedding. The network is an extended version of Tacotron 2 that supports multiple speakers. The output is conditioned according to the voice of the speaker by concatenating their embedding with the synthesizer encoder output at each time step.
Vocoder Network: The system uses the WaveNet as a vocoder. It takes the Mel spectrograms generated by the synthesis network as input and autoregressively generate the time-domain audio waveforms as output. The synthesizer network is trained such that, it tries to capture all of the relevant detail needed for the high-quality synthesis of a variety of voices in the form of Mel spectrograms. This allows the vocoder to be constructed by simply training on data from many speakers.

As of now, the official implementation of SV2TTS is not available publicly. But, there are other implementations built by the community. CorentinJ/Real-Time-Voice-Cloning💥is the best implementation that is available publicly. The project was developed by Corentin Jemine, who got his Masters in Data Science from the University of Liège. It comes with a UI that can be used to generate audio.

Here is a video tutorial of the toolbox:

Technology Misuse 👾

Although the concept of artificial voice is fascinating and has many benefits, we can’t deny the fact that this technology is susceptible to misuse. In the past few years, we have seen how Deepfakes were being used to spread misinformation and to create questionable content.

As the voice cloning algorithms are getting better, it is becoming more and more difficult to discern what’s real and what’s not. Using voice cloning people can be fooled to act on something fake just because it sounds like it’s coming from somebody real. For instance, it will become easier for scammers to perform phishing and spoofing attacks, things that people never uttered could be pushed on the internet in a planned manner for political gains, fake audio clips could also be used to create unrest in society, and the list goes on.

According to research, the human brain does not register significant differences between real and artificial voices. In fact, it is harder for our brains to distinguish fake voices than to detect fakes images. So, raising the awareness that this technology exists will be the first step toward safeguarding the listeners. Algorithms that can differentiate real voices from artificial voices should be developed alongside.

Owing to the ethics associated with this technology, many are skeptical if humans should even try creating such models. Some other researchers have refrained from sharing their models publicly.

Nevertheless, the future appears to be uncertain, as to how humanity will use this technology and what transpires from it, a dystopia or a utopia?

References🐙:

[1] https://towardsdatascience.com/wavenet-google-assistants-voice-synthesizer-a168e9af13b1

[2] https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

[3] https://arxiv.org/abs/1609.03499

[4] https://arxiv.org/abs/1702.07825

[5] https://arxiv.org/abs/1710.07654

[6] https://papers.nips.cc/paper/7700-transfer-learning-from-speaker-verification-to-multispeaker-text-to-speech-synthesis