Speak a Foreign Language in Your Own Voice? Microsoft’s VALL-E X Enables Zero-Shot Cross-Lingual Speech Synthesis

Synced
SyncedReview
Published in
4 min readMar 13, 2023

--

robotic. The leveraging of deep neural networks in recent years has dramatically transformed TTS, enabling conditioning on factors such as stress and intonation to achieve higher quality and much more humanlike results. Contemporary TTS models however still perform best when dealing with a specific speaker in a specific language. Cross-lingual speech synthesis, which aims to transfer the characteristics of a user’s voice from one language to another, has remained relatively underexplored. That just changed.

In the new paper Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling, a Microsoft research team presents VALL-E X, a simple yet effective cross-lingual neural codec language model that inherits strong in-context learning capabilities from the VALL-E TTS model and demonstrates high-quality zero-shot cross-lingual speech synthesis performance.

The team summarizes their main contributions as follows:

  1. We develop a cross-lingual neural codec language model VALL-E X with large multilingual multi-speaker…

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global