Large-scale generative models like GPT and DALL-E have transformed natural language processing and computer vision. However, speech-generative models lag behind in terms of scale and task generalization. In this blog, we introduce Voicebox, a highly versatile text-guided generative model for speech. Trained on over 50K hours of unfiltered and unenhanced speech, Voicebox is a non-autoregressive flow-matching model capable of infilling speech with audio context and text. Similar to GPT, Voicebox learns in context and can perform various tasks, including mono and cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. Notably, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E in terms of intelligibility and audio similarity while being up to 20 times faster. Check out the demo at voicebox.metademolab.com.
Voicebox: A cutting-edge AI model designed specifically for speech generation. What sets Voicebox apart is its ability to perform a wide range of speech-related tasks such as editing, sampling, and stylizing, even without explicit training in these areas. Through remarkable in-context learning, Voicebox possesses the capacity to produce top-notch audio clips and effortlessly edit pre-recorded audio while preserving the original content and style. Additionally, this multilingual marvel is proficient in generating speech in six different languages.
Potential Applications:
Looking ahead, the emergence of multipurpose generative AI models like Voicebox holds tremendous promise. Imagine a future where virtual assistants and non-player characters in the metaverse possess natural-sounding voices, or where visually impaired individuals can have their written messages read aloud in the voices of their loved ones. Voicebox could also empower content creators by providing them with convenient tools for creating and editing audio tracks for videos, among countless other possibilities.
Versatility in Action:
Voicebox exhibits remarkable versatility, boasting a range of impressive capabilities:
- In-context Text-to-Speech Synthesis:
By leveraging even a brief audio sample as short as two seconds, Voicebox can match the audio style and employ it for text-to-speech generation. This breakthrough ensures a seamless integration of voices for various applications. - Speech Editing and Noise Reduction:
Voicebox goes above and beyond by recreating segments of speech that have been interrupted by noise or replacing misspoken words, eliminating the need to re-record the entire speech. For instance, if a dog’s bark interrupts a specific section, users can simply crop that portion and instruct Voicebox to regenerate it flawlessly — akin to an eraser for audio editing. - Cross-lingual Style Transfer:
With Voicebox’s unparalleled capabilities, language barriers are effortlessly surmountable. By providing a speech sample and a text passage in English, French, German, Spanish, Polish, or Portuguese, Voicebox can generate a reading of the text in any of those languages, even when the sample speech and the text are in different languages. This innovation holds the potential to facilitate natural and authentic communication between individuals speaking different languages. - Diverse Speech Sampling:
Drawing from a vast array of diverse data, Voicebox excels at generating speech that accurately reflects real-world speech patterns in the six aforementioned languages.
Conclusion:
Voicebox represents a significant stride forward in the realm of generative AI research. The possibilities it presents for audio editing, sampling, and styling are truly transformative. In the near future, this technology could empower creators to effortlessly edit audio tracks, enable visually impaired individuals to hear written messages in familiar voices, and empower individuals to communicate in any foreign language using their own unique voice. We eagerly anticipate further advancements in the audio domain and the remarkable contributions that other researchers will make, building upon our work with Voicebox.