Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

Synced
SyncedReview
Published in
3 min readJun 12, 2024

--

Over the past decade, significant breakthroughs in speech synthesis have emerged, driven by the development of neural networks and end-to-end modeling. Last year, Microsoft introduced VALL-E, a neural codec language model capable of synthesizing high-quality personalized speech from just a 3-second recording of an unseen speaker. This model notably outperformed the state-of-the-art zero-shot text-to-speech (TTS) systems at the time.

Building on this progress, in a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

VALL-E 2, an evolution of its predecessor, employs a neural codec language modeling method for speech synthesis and introduces two significant enhancements: repetition-aware sampling and grouped code modeling.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global