Guide: The Rise of Voice Cloning Technology

Be Tech! with Santander
Be Tech! with Santander
6 min readFeb 22, 2024

By Vicente Motos.

Voice cloning is a field that has experienced significant advances in recent years, especially with Artificial Intelligence 🤖. The ability to realistically generate human voices using deep learning algorithms and models has led to the creation of various applications, from virtual assistants to automated narrators.

Artificial Intelligence Be Tech header image

However, risks such as impersonation and phishing have also been generated through increasingly realistic deepfakes, while various libraries and tools are emerging that allow anyone to easily work on voice cloning.

From the moment we get up to the moment we go to bed 😴, a large part of our activity takes place online. We socialize, study, work, shop️, and use our devices for all kinds of tasks. However, sometimes we go so fast that we don’t realize the possible risks that we can face in our daily lives.

That is why at Santander, year after year, we are committed to help customers, employees and society in general to have a safer online life and enjoy all the opportunities of the digital world 🌍. Preventing our clients from digital attacks, through awareness, training and information, is a cornerstone of the work we do to promote healthy habits in the online world. To achieve this goal, we innovate in initiatives that range from a fiction podcast 🎧 to 🎾 Rafa Nadal’s cyber advice.

Convert Text to Speech and Voice to Text

To begin, let’s look at very simple practical examples that reflect this reality. The first, to convert text to speech (TTS or Text-To-Speech) with gTTS that uses the Google API:

from gtts import gTTS
# Texto que deseas convertir a habla
texto = "Hola, ¿cómo estás? Esto es un ejemplo de gTTS."

# Crear un objeto gTTS
tts = gTTS(text=texto, lang='es')

# Guardar el archivo de audio
tts.save("ejemplo.wav")

And in the second example we will do the opposite: from voice to text but this time capturing our own speech from the 🎤 microphone:

import speech_recognition as sr

# Crear un objeto Recognizer
recognizer = sr.Recognizer()

# Capturar entrada de audio desde el micrófono
with sr.Microphone() as source:
print("Di algo:")
audio = recognizer.listen(source, timeout=5)

# Intentar reconocer el texto
try:
texto = recognizer.recognize_google(audio, language="es-ES")
print(f"Texto reconocido: {texto}")
except sr.UnknownValueError:
print("No se pudo reconocer el audio")
except sr.RequestError as e:
print(f"Error en la solicitud a Google API: {e}")
Voice cloning.

👀 Look at the ease and possibilities it offers us, for example we could combine both scripts with the openai library and build 👷 our virtual assistant in a moment with all the power of ChatGPT.

But let’s continue with voice cloning…

Now what we will do is take another audio as input and try to synthesize it with the computer 🖥️. To do this, we will start from the following obtained from one of the many websites that currently exist and that allow you to generate a personalized audio clip with the voice of your favorite character:

Have you recognized who is, right? 😁

We will see below how easy it is to synthesize that voice with another script. Basically, we will see how you can from your mel spectrogram using the Griffin-Lim algorithm.

To introduce both concepts, on the one hand, comment that Mel’s spectrogram groups the frequencies in a way that better reflects the way in which humans perceive differences in frequencies, then keep in mind that our audible range is quite low: 20 at 20,000 Hz.

We can visualize the spectrogram with this simple script:

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Cargar la señal de audio
audio_path = "audio.wav"
audio, sr = librosa.load(audio_path, sr=None)

# Calcular el espectrograma mel
mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr)

# Visualizar el espectrograma mel
librosa.display.specshow(librosa.power_to_db(mel_spectrogram, ref=np.max), y_axis='mel', x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.show()
Spectogram of Mel voice

On the other hand, Griffin-Lim is a vocoder (voice encoder-decoder) that will help us reconstruct 🏗️ the audio signal from its spectrogram. It is based on signal processing techniques with an iterative method: very roughly what it does is alternate between the short-time transform (STFT) and the inverse short-time transform (ISTFT). At each iteration, the short-time inverse transform of the modified spectrogram is taken and used to update the phase of the original signal. This process is repeated until a convergence criterion is reached.

Let’s see Griffin-Lim better in practice:

# señal de audio
audio_path = "audio.wav"
audio, sr = librosa.load(audio_path, sr=None)

# Calcular el espectrograma mel
mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr)

# Sintetizar la señal de audio con Griffin-Lim
audio_synthesized = librosa.feature.inverse.mel_to_audio(mel_spectrogram, sr=sr, hop_length=512, n_fft=2048)

# Visualizar y reproducir la señal original
librosa.display.waveshow(audio, sr=sr)
plt.title('Señal Original')
plt.show()

# Visualizar y reproducir la señal sintetizada con Griffin-Lim
librosa.display.waveshow(audio_synthesized, sr=sr)
plt.title('Señal Sintetizada con Griffin-Lim')
plt.show()

# Guardar la señal sintetizada como archivo de audio WAV
output_path = "audio_sintetizado_griffin_lim.wav"
sf.write(output_path, audio_synthesized, sr)

If you analyze the code a little, you will confirm that the script indeed loads an 🎛️ audio signal, calculates its mel spectrogram, synthesizes a new audio signal using Griffin-Lim, and then displays, plays and saves 💾 both the original and the synthesized signal.

You will see how both signals are very similar:

Original signal voice cloning
Synthesized signal with griffin lim voice cloning

And the result is also quite similar to our ears:

It’s good right? Well, that’s just with an algorithm created in 1984! 😳 You can imagine that it is nothing if we compare it with current approaches with cutting-edge generative models based on neural networks, such as WaveNet, Tacotron, DeepVoice, or vocoder models such as HiFiGAN, MelGAN or WaveGlow.

With them we can train and/or use a model to take an audio and text input and create a hyper-realistic synthetic voice that looks amazingly like the original.

There are many options but in this article, we are going to try ForwardTacotron, a model that combines elements of the FastSpeech and Tacotron models. ForwardTacotron can generate high-quality, natural speech 💬 in a single pass using a duration predictor to align the text and generated mel spectrograms.

Learn voice cloning and spectrograms

To test it, I recommend using a Colab or Kaggle notebook:

You have the pre-trained models in:

# Get pretrained models
!wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/ForwardTacotron/forward_step90k.pt
!wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/ForwardTacotron/fastpitch_step200k.pt

Then we simply load them:

# Load pretrained models
from notebook_utils.synthesize import Synthesizer
import IPython.display as ipd
synth_forward = Synthesizer(tts_path='forward_step90k.pt')
synth_fastpitch = Synthesizer(tts_path='fastpitch_step200k.pt')

And we only have to try:

# Synthesize with forward_tacotron and melgan (alpha=1.0)
input_text = 'Hello folks! this is a test for the San Expert community'
wav = synth_forward(input_text, voc_model='melgan', alpha=1)
ipd.Audio(wav, rate=synth_forward.dsp.sample_rate)
# Synthesize faster (alpha=1.2)
input_text = 'Hello folks! this is a test for the San Expert community'
wav = synth_fastpitch(input_text, voc_model='melgan', alpha=1.2)
ipd.Audio(wav, rate=synth_fastpitch.dsp.sample_rate)
# Synthesize with amplified pitch
input_text = 'Hello folks! this is a test for the San Expert community'
pitch_func = lambda x: x * 1.5
wav = synth_fastpitch(input_text, voc_model='melgan', alpha=1, pitch_function=pitch_func)
ipd.Audio(wav, rate=synth_fastpitch.dsp.sample_rate)

As you can see and hear, we can use different models, modify the speed 🏃 and pitch as we wish, and the speech result is much more natural. We load the model and as if by magic 🧙 the text will be reproduced, almost perfectly imitating the cloned voice.

This obviously has many ethical and legal connotations: a model must always be trained with the consent of its original source and its use must always be done responsibly.

But this is just an introduction, barely scratching the surface… If you liked the article and are interested in voice cloning training, leave your comment and/or your clap for a second part that promises to be more than interesting 😜.

Before you go:

Clap if you liked it 👏, comment and share this article to reach more community 🧞.

Would you like to be part of our technology project? Find our open vacancies worldwide here 👉 https://www.betechwithsantander.com/en/home

--

--