Voice recognition VS audio deepfakes

How good is voice recognition at detecting cloned voices?

Published in

The Pythoneers

10 min readMay 8, 2024

Voice recognition is about authenticating users and checking if two voices are similar. It is getting better and better at picking up what makes a voice unique, therefore improving its robustness in recognizing a voice from thousands or millions of others.

At the same time, AI has gotten really good at making audio sound like someone else — a bit like magic. Models like coqui/XTTS-v2 can clone a voice with just six seconds of recording, turning anyone into a vocal chameleon faster than you can say “audio doppelgänger”. Although it caters to various use cases such as developing effective voice assistants or localization of content generated in multiple languages, it is also a bit scary because it could also fall into the category of audio deepfakes, messing with what’s real and what’s not.

So what could we talk about in this story?

In a nutshell, we want to assess the robustness of state-of-the-art voice recognition models and figure out if voice recognition can spot the difference between real and cloned voices.

Let’s dive in.

Voice recognition with TitaNet-L

There are plenty of open-source models capable of extracting embeddings from a voice. However, we opted for the TitaNet large model from NVIDIA because it’s not only small and fast but also performs very well.

In this section, we will delve into implementing and assessing voice recognition across multiple languages using TitaNet-L.

The Model

TitaNet-L (large) is a state-of-the-art neural network architecture specifically designed for voice recognition tasks. Developed by NVIDIA as part of their NeMo (NVIDIA Neural Modules) toolkit, TitaNet-L stands out for its remarkable efficiency and accuracy in processing voice data.

This model is trained on vast amounts of speech data, enabling it to extract intricate features from voice inputs across various languages and accents. Its large size allows for comprehensive coverage of acoustic features, ensuring robust performance even in challenging audio environments.

TitaNet’s architecture. Source: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/models.html

One of the main advantages of TitaNet-L is its small footprint making it suitable for deployment on resource-constrained devices, opening up possibilities for real-time voice recognition applications. Moreover, the model is released with an Apache 2.0 license, granting users the freedom to modify, distribute, and sub-license the model as per their requirements.

To learn more about TitaNet-L and its specifications, you can visit the official page on the NVIDIA’s website catalog here.

Implementing voice recognition

Utilizing TitaNet-L for voice recognition is a straightforward process. Begin by computing the embeddings of two voice samples, then proceed to calculate their cosine similarity. The higher the similarity score, the more likely it is that the voices originate from the same user. Simple, right?

Let’s break down the process into easy-to-follow steps:

Compute voice embeddings: use the model to extract embeddings from the voice samples you want to compare. These embeddings capture the unique characteristics of each voice, enabling comparison and analysis.
Calculate the cosine similarity: once you have the embeddings for both voices, calculate their cosine similarity. This metric measures the similarity between two vectors, with values closer to 1 indicating greater similarity.
Set a threshold: determine a suitable threshold for cosine similarity to classify voices as either the same user or different users. This threshold can vary depending on the specific application and desired level of accuracy.

By following these steps, you can implement voice recognition using TitaNet-L and unlock its powerful capabilities with ease.

One crucial thing though: the model expects WAV files with a sampling rate of 16kHz. In the following code snippet, we assume this is the case.

Code snippet for voice similarity

Evaluation methodology

To assess TitaNet’s performance for voice recognition and determine the optimal threshold for distinguishing between voices of the same or different users, we employed the following methodology:

Dataset selection

The Common Voice 13 dataset emerged as a promising choice due to its extensive collection of audio recordings spanning 108 languages. To access the dataset, users are required to accept its conditions and utilize a token provided by HuggingFace for downloading purposes.

Language selection

For our evaluation, we opted to analyze three distinct languages: French, Italian, and Hindi. However, users are encouraged to select any other language should they wish to replicate the evaluation process.

Evaluation

For each language, we selected a maximum of 2000 audio files to streamline processing time and resource consumption. The evaluation proceeded as follows:

Data preprocessing: we preprocessed the dataset to retain only audio recordings longer than 3 seconds but shorter than 30 seconds. This duration range was chosen to ensure that files are sufficiently long to capture voice characteristics without becoming overly lengthy and potentially degrading performance.
Embedding extraction: we extracted embeddings from all selected files and organized them in a dictionary, associating each speaker ID with their respective embeddings.
Cosine similarity calculation: we computed the cosine similarity between embeddings for all pairs of audio files. For comparisons involving the same speaker (identified by the speaker ID), we stored the similarity scores in an array labeled “positives.” Conversely, for comparisons between different speakers, the scores were stored in another array labeled “negatives.”
Graphical representation: we generated two graphs to visualize the distribution of similarity scores for positives and negatives, as well as the performance metrics (Accuracy, Recall, Precision, and F1-score) across different threshold values.

Given the dataset’s diverse range of speakers, we anticipated a larger number of negatives compared to positives. This is reflected in the graphical representations below, particularly noticeable in the low precision rates observed when the threshold approaches zero. In that case, there are mostly true negatives and relatively few true positive, resulting in a precision score close to 0.

This rigorous evaluation framework provides insights into TitaNet’s performance and helps determine an optimal threshold for effective voice recognition across multiple languages.

Outcome of the evaluation

The outcome of our evaluation is the following:

French

Number of positives found: 751
Number of negatives found: 1843409
Number of audio files after filtering on length: 1921

French — Metrics VS threshold and Distribution of positives and negatives by score range

The best F1-score (80%) is reached when the threshold is at 82%. The majority of positives are in a range between 80% and 94% of similarity.

Italian

Number of positives found: 1457
Number of negatives found: 1887139
Number of audio files after filtering on length: 1944

Italian — Metrics VS threshold and Distribution of positives and negatives by score range

The best F1-score (90%) is reached when the threshold is at 82%. The majority of positives are in a range between 80% and 94% of similarity.

Hindi

Number of positives found: 15546
Number of negatives found: 1683700
Number of audio files after filtering on length: 1844

Hindi — Metrics VS threshold and Distribution of positives and negatives by score range

The best F1-score (89.4%) is reached when the threshold is at 79%. The majority of positives are in a range between 75% and 92% of similarity.

From the evaluation results across the three languages, several observations stand out:

Optimal threshold for maximum F1-score: the evaluation shows that the highest F1-score is achieved when the threshold is approximately 80% for all languages. This indicates that setting the threshold at this level yields the best balance between precision and recall in voice recognition tasks.
Limited number of negatives with high scores: another noteworthy finding is the presence of a limited number of negatives with scores exceeding 75%. This suggests that while the model’s overall performance is commendable, there are still a limited number of instances where accurately distinguishing between voices of the same speaker and different speakers remains a challenge.

Performing a thorough evaluation is paramount to set an effective threshold that ensures a right balance between correctly identifying true positives and minimizing false positives and false negatives. Moreover, fine-tuning the model, particularly on speech domains different from those used for training, offers an interesting opportunity to further enhance its performance.

Voice cloning with Coqui-XTTS-v2

Coqui-XTTS-v2, powered by TTS (Text-to-Speech) technology, revolutionizes voice generation by enabling the seamless cloning of voices into various languages using just a quick 6-second audio clip. Coqui-XTTS-v2 makes voice cloning more accessible and efficient.

The model holds a bunch of impressive features:

Language support: it supports an extensive range of 17 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi.
Voice cloning: with just a 6-second audio clip, users can effortlessly clone voices, transcending linguistic barriers.
Emotion and style transfer: the model allows for the transfer of emotions and styles during voice cloning, adding depth and nuance to synthesized speech.
Cross-Language cloning: users can seamlessly clone voices across different languages, enhancing versatility and adaptability.
High sampling rate: With a sampling rate of 24kHz, the model ensures high-fidelity audio output.

For more information about XTTS-v2, please refer to https://docs.coqui.ai/en/latest/index.html.

You can also try the model with this space: https://huggingface.co/spaces/coqui/xtts

Generation of a cloned voices dataset

To generate a dataset of cloned voices, we implemented the following methodology:

Using a Large Language Model (LLM) for text generation: we leveraged ChatGPT to generate 50 short sentences in French, Italian, and Hindi. These sentences served as the basis for synthesizing voice recordings in multiple languages.
Selection of speaker IDs from Common Voice dataset: within the Common Voice dataset, we carefully selected 10 speaker IDs per language, ensuring that each participant had a minimum of 6 recordings (one reference and 5 others for comparison). This meticulous selection process allowed us to create a diverse pool of speakers for voice cloning, encompassing both male and female voices.
Voice cloning with Coqui-XTTS-v2: for each selected speaker, we utilized Coqui-XTTS-v2 to generate 5 additional voice recordings. These recordings were synthesized using the sentences previously generated by ChatGPT, resulting in a dataset of cloned voices across multiple languages.
Similarity computation: for each speaker, we computed the similarity between the probe voice (original recording) and the other recordings (both real and generated). This comparison allowed us to evaluate the extent to which the cloned voices resembled the original recordings.
Graphical representation: we visualized the similarity scores on a graph, providing insights into the effectiveness of voice cloning across different languages and speakers.

Workflow to generate the cloned voices dataset and compare it with real voices.

Outcome

For each three languages, we prepared the same graphical representations as we did for the evaluation. The following results were obtained:

French

French — Distribution of similarity scores for real and cloned voice / Metrics against thresholds

The best F1-score (81.1%) is reached when the threshold is at 85%. 18% of the cloned voices generated score are above 85% of similarity.

Italian

Italian — Distribution of similarity scores for real and cloned voice / Metrics against thresholds

The best F1-score (90.3%) is reached when the threshold is at 87%. 10% of the cloned voices generated score are above 87% of similarity.

Hindi

The best F1-score (89.8%) is reached when the threshold is at 87%. 10% of the cloned voices generated score are above 87% of similarity.

To prevent the authentication of cloned voices as real ones and achieve a maximized F1-score, thresholds must be set at significantly higher values compared to those observed in the voice recognition evaluation process, specifically around 5% above their observed values. Furthermore, this evaluation highlights the striking similarity between cloned voices and real ones, presenting both potential advantages and risks if exploited for nefarious purposes.

Conclusion

This article delved into voice recognition and voice cloning, exploring the evolving landscape of AI technologies in this domain. Through meticulous evaluation and analysis across multiple languages, we uncovered insights into the performance of voice recognition models like TitaNet-L and the implications of voice cloning using Coqui-XTTS-v2.

Key findings highlighted the importance of fine-tuning thresholds to strike a balance between precision and recall in voice recognition tasks. Moreover, the evaluation revealed the remarkable resemblance between cloned voices and authentic ones, underlining both the potential benefits and risks associated with this technology.

Moving forward, it is imperative for researchers, developers, and policymakers to continue advancing these technologies responsibly, considering ethical implications and safeguarding against potential misuse. By responsibly leveraging the capabilities of voice recognition and cloning, we can unlock a myriad of innovative applications, all while safeguarding privacy, bolstering security measures, and promoting fairness for all stakeholders.

I hope you enjoyed this story.

PS: source code for the evaluation of voice recognition can be provided upon request.