Meet EMO — Alibaba’s AI That Makes Pictures Into Videos

Sorin Ciornei

Published in

thereach.ai

5 min readFeb 29, 2024

This actually reminds me of the photos in the Harry Potter universe.

Image generated with Midjourney, prompt: Audrey Hepburn with the Joker face paint, photography

Expressive Avatar Videos with Alibaba’s AI

Alibaba Group has pioneered groundbreaking technology with EMO: Emote Portrait Alive. This innovative project, developed by the Institute for Intelligent Computing at Alibaba, introduces an expressive audio-driven portrait-video generation framework. Let’s look at the capabilities, methods, and diverse applications of EMO.

On Medium unfortunately you cannot add videos directly (saving hosting costs of course), so if you want to see the actual videos with sound (since this is the most important part) click on the article below, the link above or the individual links I added for the most interesting examples.

Meet EMO - Alibaba's AI That Makes Pictures Into Videos

Expressive Avatar Videos with Alibaba's AI Alibaba Group has pioneered groundbreaking technology with EMO: Emote…

thereach.ai

Understanding EMO’s Methodology

Frames Encoding and Diffusion Process

The EMO framework consists of two crucial stages — Frames Encoding and Diffusion Process. In the initial phase, Frames Encoding, features are extracted from a reference image and motion frames using ReferenceNet. The subsequent Diffusion Process involves a pretrained audio encoder, facial region mask integration, and denoising operations through Backbone Network. Attention mechanisms like Reference-Attention and Audio-Attention ensure identity preservation and movement modulation. Temporal Modules manipulate the temporal dimension and adjust motion velocity.

If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!

Examples of what it does

Singing Avatars

EMO is exemplified by generating vocal avatar videos with expressive facial expressions, head poses, and variable durations. It effortlessly transforms static character images into dynamic, singing avatars.

AI Mona Lisa singing Miley Cyrus — Flowers. Covered by YUQI
AI Lady from SORA singing Dua Lipa — Don’t Start Now

30% of The Time You'll Fall For it: Unmasking the Threat of Speech Deepfakes

The Menace of Speech Deepfakes: Instances of Fraud Emerge Adversarial exploitation of speech deepfakes has become a…

thereach.ai

In a disconcerting revelation, a recent study conducted by University College London (UCL) has illuminated the striking challenges humans face in detecting deepfake speech, with an accuracy rate of just 73%.

AI for Business Owners Masterclass - Academy @ thereach.ai

This masterclass series is specifically tailored for business owners who aim to integrate artificial intelligence into…

academy.thereach.ai

Multilingual Styles

EMO can can deliver results in many languages, including singing! Whether Mandarin, Japanese, Cantonese, or Korean, it intuitively recognizes tonal variations in audio for expression-rich avatars.

AI Girl: David Tao — Melody. Covered by NINGNING (Mandarin)
AI Ymir from AnyLora & Ymir Fritz: ‘進撃の巨人’ Ending Theme (Japanese)
AI Girl: JENNIE — SOLO. Cover by Aiana (Korean)

Rapid Rhythm Synchronization

EMO ensures avatars synchronize with fast-paced rhythms, guaranteeing dynamic character animations that match even the swiftest lyrics.

Leonardo Wilhelm DiCaprio: EMINEM — GODZILLA (FT. JUICE WRLD) COVER

Check out the video with sound on the link above

KUN KUN: Eminem — Rap God

Talking Portraits

Beyond singing, EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.

Audrey Kathleen Hepburn-Ruston: Interview Clip

AI Chloe: Detroit Become Human: Interview Clip
Mona Lisa: Shakespeare’s Monologue II As You Like It: Rosalind “Yes, one; and in this manner.”
AI Ymir from AnyLora & Ymir Fritz: NieR: Automata

AI for Marketers Masterclass - Academy @ thereach.ai

This masterclass is specifically designed for marketers, advertising professionals, and business owners interested in…

academy.thereach.ai

Cross-Actor Performance

Explore EMO’s potential applications by enabling portraits of movie characters to deliver monologues or performances in different languages and styles.

Joaquin Rafael Phoenix — The Joker: 《The Dark Knight》 2008

SongWen Zhang — QiQiang Gao: 《The Knockout》 Online courses for legal exams
AI girl: Videos published by itsjuli4.

Deepfake Scammers Defraud Hong Kong Multinational Office of HK$200 Million

Discover the alarming deepfake scam that cost a Hong Kong multinational firm HK$200 million. Unraveling the intricate…

thereach.ai

Check out the EMO research paper and page here, where you can find all the examples.

Frequently Asked Questions

Q: What is EMO’s primary purpose?

A: EMO is designed for expressive audio-driven portrait-video generation, transforming static images into dynamic avatars for singing, talking, and performing.

Q: How does EMO handle different languages in singing avatars?

A: EMO intuitively recognizes tonal variations, supporting songs in various languages like Mandarin, Japanese, Cantonese, and Korean.

Q: Is EMO limited to singing avatars, or can it handle spoken audio as well?

A: EMO accommodates spoken audio in various languages, animating portraits from diverse sources with lifelike motion and realism.

I appreciate your time and attention to my latest article. Here at Medium and at LinkedIn I regularly write about AI, workplace, business and technology trends. If you enjoyed this article, you can also find it on www.thereach.ai, a website dedicated to showcasing AI applications and innovations.

Meet EMO — Alibaba’s AI That Makes Pictures Into Videos

Expressive Avatar Videos with Alibaba’s AI

Meet EMO - Alibaba's AI That Makes Pictures Into Videos

Expressive Avatar Videos with Alibaba's AI Alibaba Group has pioneered groundbreaking technology with EMO: Emote…

Understanding EMO’s Methodology

Frames Encoding and Diffusion Process

Examples of what it does

Singing Avatars

30% of The Time You'll Fall For it: Unmasking the Threat of Speech Deepfakes

The Menace of Speech Deepfakes: Instances of Fraud Emerge Adversarial exploitation of speech deepfakes has become a…

AI for Business Owners Masterclass - Academy @ thereach.ai

This masterclass series is specifically tailored for business owners who aim to integrate artificial intelligence into…

Multilingual Styles

Rapid Rhythm Synchronization

Talking Portraits

AI for Marketers Masterclass - Academy @ thereach.ai

This masterclass is specifically designed for marketers, advertising professionals, and business owners interested in…

Cross-Actor Performance

Deepfake Scammers Defraud Hong Kong Multinational Office of HK$200 Million

Discover the alarming deepfake scam that cost a Hong Kong multinational firm HK$200 million. Unraveling the intricate…

Frequently Asked Questions

Q: What is EMO’s primary purpose?

Q: How does EMO handle different languages in singing avatars?

Q: Is EMO limited to singing avatars, or can it handle spoken audio as well?

Written by Sorin Ciornei