The Research Nest
Published in

The Research Nest

Voice Cloning Using Deep Learning

Latest Technology Trends And Threats

1. WaveNet 🌊

It was introduced in 2016 by Google-backed DeepMind in the paper WaveNet: A Generative Model for Raw Audio. Earlier text-to-speech systems (TTS) were largely based on the concatenative TTS. In this approach, first, a very large database of short speech fragments is recorded from a single speaker. Then these fragments are recombined to form the complete utterances. The downside of this approach is that you will need a totally new database of audio samples if you want to make minor tweaks to the voice, like altering the emphasis or emotion. Also, the audio samples generated by this approach are very unnatural, glitchy and robotic.

2. Deep Voice 🗣

Deep Voice is a TTS system developed by the researchers at Baidu. Its first version, Deep Voice 1 was inspired by the traditional text-to-speech pipelines. It adopts the same structure, but replaces all components with neural networks and uses simpler features. First, it converts the text to phonemes and then uses an audio synthesis model to convert linguistic features into speech.

Deep Voice 3 Architecture
  • Encoder: It is a convolutional neural network that converts the textual features to an internal learned representation.
  • Decoder: The decoder is used to decode the learned representation coming from encoder into a low-dimensional audio representation (or Mel spectrograms). It consists of casual convolutions with multi-hop convolutional attention and it generates its output in an autoregressive manner.
  • Converter: It is a fully-convolutional post-processing network. The converter predicts final parameters from the decoder hidden states. This is non-causal and can thus depend on future context information.
  • Speaker Adaptation: The objective of this approach is to fine-tune a trained multi-speaker model for an unseen speaker using a few audio-text pairs. Fine-tuning can be applied to either the speaker embedding or the whole model.
  • Speaker Encoding: The speaker encoding method directly estimates the speaker embedding from audio samples of an unseen speaker. A model like this does not require any fine-tuning during voice cloning. Thus, the same model can be used for all unseen speakers.

3. SV2TTS 🚀

The paper Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(SV2TTS) describes a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers. The thing that stands out for this system is that it can work in the zero-shot learning setting 👌🏻.

“This method aims to solve a task without receiving any example of that task at the training phase”

So, it can generate new speech in the voice of a previously unseen speaker, using only a few seconds of untranscribed reference audio, without updating any model parameters.

  • Speaker Encoder Network: The speaker encoder network’s job is to take audio of a given speaker as input, and encode the characteristics of their voice into a low dimensional vector embedding. It does not care about what the speaker is saying, all it cares about is how a speaker is saying something. The network is trained separately on the task of speaker verification, using a dataset of noisy speech from thousands of speakers. The encodings are then used to condition the synthesis network on a reference speech signal from the desired target speaker.
  • Synthesis Network: It is a Seq2Seq neural network based on google’s Tacotron 2 that generates a Mel spectrogram from the text, conditioned on the speaker embedding. The network is an extended version of Tacotron 2 that supports multiple speakers. The output is conditioned according to the voice of the speaker by concatenating their embedding with the synthesizer encoder output at each time step.
  • Vocoder Network: The system uses the WaveNet as a vocoder. It takes the Mel spectrograms generated by the synthesis network as input and autoregressively generate the time-domain audio waveforms as output. The synthesizer network is trained such that, it tries to capture all of the relevant detail needed for the high-quality synthesis of a variety of voices in the form of Mel spectrograms. This allows the vocoder to be constructed by simply training on data from many speakers.

Technology Misuse 👾

Although the concept of artificial voice is fascinating and has many benefits, we can’t deny the fact that this technology is susceptible to misuse. In the past few years, we have seen how Deepfakes were being used to spread misinformation and to create questionable content.



Empowering humanity with exclusive insights.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store