Meet EMO — Alibaba’s AI That Makes Pictures Into Videos
This actually reminds me of the photos in the Harry Potter universe.
Expressive Avatar Videos with Alibaba’s AI
Alibaba Group has pioneered groundbreaking technology with EMO: Emote Portrait Alive. This innovative project, developed by the Institute for Intelligent Computing at Alibaba, introduces an expressive audio-driven portrait-video generation framework. Let’s look at the capabilities, methods, and diverse applications of EMO.
On Medium unfortunately you cannot add videos directly (saving hosting costs of course), so if you want to see the actual videos with sound (since this is the most important part) click on the article below, the link above or the individual links I added for the most interesting examples.
Understanding EMO’s Methodology
Frames Encoding and Diffusion Process
The EMO framework consists of two crucial stages — Frames Encoding and Diffusion Process. In the initial phase, Frames Encoding, features are extracted from a reference image and motion frames using ReferenceNet. The subsequent Diffusion Process involves a pretrained audio encoder, facial region mask integration, and denoising operations through Backbone Network. Attention mechanisms like Reference-Attention and Audio-Attention ensure identity preservation and movement modulation. Temporal Modules manipulate the temporal dimension and adjust motion velocity.
If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!
Examples of what it does
Singing Avatars
EMO is exemplified by generating vocal avatar videos with expressive facial expressions, head poses, and variable durations. It effortlessly transforms static character images into dynamic, singing avatars.
- AI Mona Lisa singing Miley Cyrus — Flowers. Covered by YUQI
- AI Lady from SORA singing Dua Lipa — Don’t Start Now
In a disconcerting revelation, a recent study conducted by University College London (UCL) has illuminated the striking challenges humans face in detecting deepfake speech, with an accuracy rate of just 73%.
Multilingual Styles
EMO can can deliver results in many languages, including singing! Whether Mandarin, Japanese, Cantonese, or Korean, it intuitively recognizes tonal variations in audio for expression-rich avatars.
- AI Girl: David Tao — Melody. Covered by NINGNING (Mandarin)
- AI Ymir from AnyLora & Ymir Fritz: ‘進撃の巨人’ Ending Theme (Japanese)
- AI Girl: JENNIE — SOLO. Cover by Aiana (Korean)
Rapid Rhythm Synchronization
EMO ensures avatars synchronize with fast-paced rhythms, guaranteeing dynamic character animations that match even the swiftest lyrics.
- Leonardo Wilhelm DiCaprio: EMINEM — GODZILLA (FT. JUICE WRLD) COVER
- KUN KUN: Eminem — Rap God
Talking Portraits
Beyond singing, EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.
- Audrey Kathleen Hepburn-Ruston: Interview Clip
- AI Chloe: Detroit Become Human: Interview Clip
- Mona Lisa: Shakespeare’s Monologue II As You Like It: Rosalind “Yes, one; and in this manner.”
- AI Ymir from AnyLora & Ymir Fritz: NieR: Automata
Cross-Actor Performance
Explore EMO’s potential applications by enabling portraits of movie characters to deliver monologues or performances in different languages and styles.
- SongWen Zhang — QiQiang Gao: 《The Knockout》 Online courses for legal exams
- AI girl: Videos published by itsjuli4.
Check out the EMO research paper and page here, where you can find all the examples.
Frequently Asked Questions
Q: What is EMO’s primary purpose?
A: EMO is designed for expressive audio-driven portrait-video generation, transforming static images into dynamic avatars for singing, talking, and performing.
Q: How does EMO handle different languages in singing avatars?
A: EMO intuitively recognizes tonal variations, supporting songs in various languages like Mandarin, Japanese, Cantonese, and Korean.
Q: Is EMO limited to singing avatars, or can it handle spoken audio as well?
A: EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.
I appreciate your time and attention to my latest article. Here at Medium and at LinkedIn I regularly write about AI, workplace, business and technology trends. If you enjoyed this article, you can also find it on www.thereach.ai, a website dedicated to showcasing AI applications and innovations.
If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!