Meet EMO — Alibaba’s AI That Makes Pictures Into Videos

Sorin Ciornei
thereach.ai
Published in
5 min readFeb 29, 2024

--

This actually reminds me of the photos in the Harry Potter universe.

Image generated with Midjourney, prompt: Audrey Hepburn with the Joker face paint, photography

Expressive Avatar Videos with Alibaba’s AI

Alibaba Group has pioneered groundbreaking technology with EMO: Emote Portrait Alive. This innovative project, developed by the Institute for Intelligent Computing at Alibaba, introduces an expressive audio-driven portrait-video generation framework. Let’s look at the capabilities, methods, and diverse applications of EMO.

On Medium unfortunately you cannot add videos directly (saving hosting costs of course), so if you want to see the actual videos with sound (since this is the most important part) click on the article below, the link above or the individual links I added for the most interesting examples.

Understanding EMO’s Methodology

Frames Encoding and Diffusion Process

The EMO framework consists of two crucial stages — Frames Encoding and Diffusion Process. In the initial phase, Frames Encoding, features are extracted from a reference image and motion frames using ReferenceNet. The subsequent Diffusion Process involves a pretrained audio encoder, facial region mask integration, and denoising operations through Backbone Network. Attention mechanisms like Reference-Attention and Audio-Attention ensure identity preservation and movement modulation. Temporal Modules manipulate the temporal dimension and adjust motion velocity.

If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!

Examples of what it does

Singing Avatars

EMO is exemplified by generating vocal avatar videos with expressive facial expressions, head poses, and variable durations. It effortlessly transforms static character images into dynamic, singing avatars.

In a disconcerting revelation, a recent study conducted by University College London (UCL) has illuminated the striking challenges humans face in detecting deepfake speech, with an accuracy rate of just 73%.

Multilingual Styles

EMO can can deliver results in many languages, including singing! Whether Mandarin, Japanese, Cantonese, or Korean, it intuitively recognizes tonal variations in audio for expression-rich avatars.

  • AI Girl: David Tao — Melody. Covered by NINGNING (Mandarin)
  • AI Ymir from AnyLora & Ymir Fritz: ‘進撃の巨人’ Ending Theme (Japanese)
  • AI Girl: JENNIE — SOLO. Cover by Aiana (Korean)

Rapid Rhythm Synchronization

EMO ensures avatars synchronize with fast-paced rhythms, guaranteeing dynamic character animations that match even the swiftest lyrics.

Check out the video with sound on the link above
  • KUN KUN: Eminem — Rap God

Talking Portraits

Beyond singing, EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.

Check out the video with sound on the link above
  • AI Chloe: Detroit Become Human: Interview Clip
  • Mona Lisa: Shakespeare’s Monologue II As You Like It: Rosalind “Yes, one; and in this manner.”
  • AI Ymir from AnyLora & Ymir Fritz: NieR: Automata

Cross-Actor Performance

Explore EMO’s potential applications by enabling portraits of movie characters to deliver monologues or performances in different languages and styles.

Check out the video with sound on the link above
  • SongWen Zhang — QiQiang Gao: 《The Knockout》 Online courses for legal exams
  • AI girl: Videos published by itsjuli4.

Check out the EMO research paper and page here, where you can find all the examples.

Frequently Asked Questions

Q: What is EMO’s primary purpose?

A: EMO is designed for expressive audio-driven portrait-video generation, transforming static images into dynamic avatars for singing, talking, and performing.

Q: How does EMO handle different languages in singing avatars?

A: EMO intuitively recognizes tonal variations, supporting songs in various languages like Mandarin, Japanese, Cantonese, and Korean.

Q: Is EMO limited to singing avatars, or can it handle spoken audio as well?

A: EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.

I appreciate your time and attention to my latest article. Here at Medium and at LinkedIn I regularly write about AI, workplace, business and technology trends. If you enjoyed this article, you can also find it on www.thereach.ai, a website dedicated to showcasing AI applications and innovations.

If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!

--

--

Sorin Ciornei
thereach.ai

Passionate about technology, nature, ecosystems and exceptional cuisine. Newsletter - https://t.co/YApNUM9Pjq