How to create a real human Voice Cloning – Unparalleled Voice Cloning

MAKS
3 min readNov 23, 2022

--

Create a high-quality voice clone from a real human voice and use it in all your business and commercial applications.

Photo by Soundtrap on Unsplash

Peregrine ™ A Generative Text to Speech Model by Play .ht

Unlike most standard speech synthesis machine learning models and Text to Speech APIs that are designed to trade quality and expressiveness for computing performance, we designed Peregrine from the ground up to generate the most expressive speech and imitate a human voice vividly.

Peregrine employs the same concept as large language models such as Dalle and GPT-2.

As a result, the ultra-realistic voices generated by Peregrine pick up the intricacies of human speech like no other. Be it emotion, tone, or even laughter! All in a self-supervised manner.

A Generative Audio approach with Peregrine

Unlike most standard Speech Synthesis ML models and Text to Speech APIs that are designed to trade quality and expressiveness for compute, Peregrine was designed from the ground up to generate the most expressive and emotional speech and imitate a human voice vividly.

Peregrine employs the same concept as large language models such as Dalle and GPT-2.

As a result our model, Peregrine, can not only speak in thousands of voices, but has also learned the intricacies of human speech like emotion, tone, even laughter — all in a self-supervised manner.

Aside from the great improvement on naturalness, voice cloning can be done with less than 30 seconds of recorded audio from a single speaker without the need of transcripts, bringing the multi-speaker, multi-style capability of TTS based applications to another level of performance.

And because it is a Large Language Model, it has the ability to compress 100s of thousands of voices in a few GBs of knowledge that can then generate an infinite number of voice variations, emotions, and styles.

We believe it is a stepping stone in the field of AI Voice Generation and Voice Cloning.

The challenges of creating human-like Text to Speech

Text to speech (TTS) synthesizers have gone through great advances since the introduction of neural networks.

As a result, TTS systems are now able to synthesize multi-language, multi-speaker, multi-style high quality speech.

However, despite these achievements, current TTS systems usually demand high quality studio-recorded annotated audio from different speakers with different styles and emotions in order to fulfill the needs for commercial applications.

Furthermore, the addition of a new speaker to the model usually requires at least 30 minutes of clean studio recorded data with phonetic annotations.

And yet, the synthesized speech would still sound mostly unnatural due to its prosody lacking expressiveness (tempo, rhythm, power).

Our approach moves beyond the current technology by introducing a novel TTS method which is able to synthesize speech with a higher degree of realism, making it basically undistinguishable from natural speech as spoken by humans.

And to achieve this we don’t rely on high quality annotated data but audio itself as its naturally uttered.

--

--