Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis
Generative speech models leveraging audio-text prompts have paved the way for exceptional advancements in zero-shot text-to-speech synthesis. Yet, these models still grapple with diverse challenges, particularly when tasked with transforming input speech across varied audio-text-based speech generation scenarios.
To address these challenges, in a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.
The proposed SpeechX is built upon VALL-E, which leverages the Transformer-based neural codec language model — EnCodec to generate neural codes conditioned on textual and acoustic prompts. More specifically, SpeechX uses autoregressive (AR) to output the neural codes of the first quantization layer of EnCodec and…