Speak a Foreign Language in Your Own Voice? Microsoft’s VALL-E X Enables Zero-Shot Cross-Lingual Speech Synthesis

Published in

SyncedReview

4 min readMar 13, 2023

robotic. The leveraging of deep neural networks in recent years has dramatically transformed TTS, enabling conditioning on factors such as stress and intonation to achieve higher quality and much more humanlike results. Contemporary TTS models however still perform best when dealing with a specific speaker in a specific language. Cross-lingual speech synthesis, which aims to transfer the characteristics of a user’s voice from one language to another, has remained relatively underexplored. That just changed.

In the new paper Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling, a Microsoft research team presents VALL-E X, a simple yet effective cross-lingual neural codec language model that inherits strong in-context learning capabilities from the VALL-E TTS model and demonstrates high-quality zero-shot cross-lingual speech synthesis performance.

The team summarizes their main contributions as follows:

We develop a cross-lingual neural codec language model VALL-E X with large multilingual multi-speaker…

Speak a Foreign Language in Your Own Voice? Microsoft’s VALL-E X Enables Zero-Shot Cross-Lingual Speech Synthesis

Written by Synced