Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

Synced
SyncedReview
Published in
3 min readAug 23, 2023

--

Generative speech models leveraging audio-text prompts have paved the way for exceptional advancements in zero-shot text-to-speech synthesis. Yet, these models still grapple with diverse challenges, particularly when tasked with transforming input speech across varied audio-text-based speech generation scenarios.

To address these challenges, in a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

The proposed SpeechX is built upon VALL-E, which leverages the Transformer-based neural codec language model — EnCodec to generate neural codes conditioned on textual and acoustic prompts. More specifically, SpeechX uses autoregressive (AR) to output the neural codes of the first quantization layer of EnCodec and…

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global