Meet EMO. Any Portrait Image Is Now Live.

EMO or Emote Portrait Alive could generate expressive portrait videos with only audio, by Alibaba.

Mika.i Chak
3 min readMar 4, 2024

With one single portrait as reference image in any head poses, and an audio file that could be speaking or singing, you can now generate an expressive head with facial expressions including micro-expressions and natural head movements based on the content of the audio, to generate video in any duration length.

Source: https://humanaigc.github.io/emote-portrait-alive/

Time for questions

There are questions like how do this method map the relationship between audio cues and facial expressions, how do images transition and connect to the last image but yet portraying a consistent and intended facial expressions, how to prevent facial distortions, how to remove jittering between video frames, can it support multilingual and so on.

So now, how it works?

First, there are many different techniques being used. Stable Diffusion for image generation, ReferenceNet to ensure the character in the video is consistent with the reference image, Backbone Network for denoising, Wav2Vec for speech recognition and Temporal modules to ensure continuity of the generated frames and seamless transitions between video clips by using target head motion speed into the video generation with the help from Face Locator and Speed Layers.

Second, there are two stages. In the first stage, ReferenceNet is deployed to extract features such as face shape, skin tone and hair style from the reference image. Subsequently, during the Diffusion Process stage, a pretrained audio encoder processes the audio embedding that essentially extracts acoustic features such as pitch and emotion to control mouth shapes and head movements. In the same stage, Facial Region Mask integrated with multi-frame noise to govern the generation of key facial features such as mouth, eyes and nose. This is followed by the employment of the Backbone Network to facilitate the denoising operation. Within the Backbone Network, two forms of attention mechanisms are applied: Reference-Attention and Audio-Attention. These mechanisms are essential for preserving the character’s identity and modulating the character’s movements, respectively. Additionally, Temporal Modules are utilized to manipulate the temporal dimension, and adjust the velocity of motion to enhance cohesion and stability of the video by processing frames in group to avoid potential jitters/flickers. Speed Control Layer is also used to adjust pace of head movement to match the audio input to prevent unnaturally fast or slow motion and ensures a natural and consistent movement.

Source: https://humanaigc.github.io/emote-portrait-alive/

I’m just touching the surface above with still so much more going on with the Speed Embedding, Temporal Modules and so on. With this method, content creation will soon be even more “real”.

References:
https://arxiv.org/pdf/2402.17485.pdf
https://humanaigc.github.io/emote-portrait-alive/

--

--