Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

Published in

SyncedReview

3 min readAug 23, 2023

Generative speech models leveraging audio-text prompts have paved the way for exceptional advancements in zero-shot text-to-speech synthesis. Yet, these models still grapple with diverse challenges, particularly when tasked with transforming input speech across varied audio-text-based speech generation scenarios.

To address these challenges, in a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

The proposed SpeechX is built upon VALL-E, which leverages the Transformer-based neural codec language model — EnCodec to generate neural codes conditioned on textual and acoustic prompts. More specifically, SpeechX uses autoregressive (AR) to output the neural codes of the first quantization layer of EnCodec and…

Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

Written by Synced