Write A Catalyst

Write A Catalyst and Build it into Existence.

New Age Of AI Videos with OmniHuman-1

5 min readFeb 12, 2025

--

Chinese AI advancements are making waves this year. Deepseek, Qwen, and now OmniHuman-1. This cutting-edge AI framework can generate incredibly realistic human videos using nothing more than a single image! While the model isn’t publicly available yet (the research paper was just released), the potential is groundbreaking!

What is OmniHuman-1?

OmniHuman-1 is an AI framework that animates a single image into a lifelike video. Integrating multiple signals — such as audio, video snippets, text, and pose data- generates realistic videos, making the animated subjects appear natural and expressive. It uses a Diffusion Transformer (DiT) architecture and a “omni-conditions training strategy” that gradually introduces more complex motion cues during training. This method allows the model to learn from large, diverse datasets, overcoming the limitations of previous approaches that struggled with data scarcity.

What Makes OmniHuman-1 So Special?

OmniHuman-1 stands out because of its unique approach and impressive capabilities —

  • Perfect Lip Sync & Natural Movements: It excels at ensuring that the mouth movements perfectly match the spoken words (lip-sync) and that the body gestures look natural and fluid. This is crucial for applications such as creating virtual characters, educational videos, or entertainment content.
  • It’s Like Magic (But It’s AI): Imagine taking a photo of someone and, with the help of some audio or another video, making them come alive — speaking, singing, or even dancing. That’s what OmniHuman-1 does. It’s called “multimodality motion conditioning”. It cleverly mixes different types of information (like a picture and a voice recording) to create realistic videos.
  • Works with All Kinds of Images: Whether you have a close-up portrait, a half-body shot, or a full-body image, it can handle it. It also works with different video formats, making it incredibly flexible for various content creation needs.
  • Stunningly Real: The videos produced by OmniHuman-1 are remarkably lifelike. Facial expressions, gestures, and overall synchronization are so accurate that it’s hard to believe they weren’t filmed.
  • Beyond Humans: It just does not animate humans, It can also work with cartoons, animals, and even inanimate objects, opening up exciting possibilities for animators and game developers.

Basic Architecture Overview

The research paper highlights that OmniHuman-1 is built upon the MMDiT (Masked Multimodal Diffusion Transformer) architecture, a powerful foundation derived from existing video generation models like DiT. But OmniHuman-1 doesn’t just use MMDiT, it significantly enhances it to handle the complexities of human animation. I will try to break it down into more simpler terms for you guys —

1) MMDiT:- This architecture is already designed for handling multiple types of data (multimodal), making it a perfect starting point. It uses a Transformer. The “Masked” part refers to a technique used during training to help the model learn more effectively.

2) Omni-Conditions:- This approach allows the model to integrate various motion-related conditions during training. These conditions include:

  • Text: Provides a general description of the desired action or scene.
  • Audio: Used for lip-syncing and capturing the rhythm and emotion of speech or music.
  • Pose: Provides precise information about body movements, either from a video or a pose estimation system.
  • Reference Image: This is the input image that provides the visual appearance of the person to be animated.

3) How the Conditions are Injected:- This paper describes specific techniques for incorporating each condition:

  • Audio: Features extracted from the audio using a model called wav2vec are processed and fed into the Transformer blocks via cross-attention. This allows the audio information to directly influence the video generation process.
  • Pose: A pose guider encodes the pose information (usually in the form of heatmaps representing body joints) into features. These features are then combined with the video features and fed into the model.
  • Reference Image: Instead of adding extra components, OmniHuman-1 reuses its existing DiT backbone to encode the reference image. The image and video information are packed together and fed into the model, allowing them to interact through self-attention. A neat trick is used to differentiate the image information from the video to make the processing easy(modifying 3D Rotational Position Embeddings (RoPE)).
  • Text: Handled similarly to the original MMDiT text branch.

4) The 3DVAE: Before the video data even reaches the Transformer, it’s processed by a causal 3DVAE. This component compresses the video into a more manageable representation (latent space), making the processing more efficient.

The Training Process

The paper outlines a progressive, multi-stage approach, guided by two key principles:

  1. Data that might be unsuitable for training a model solely on audio (e.g., due to poor lip-sync) can still be valuable for training with weaker conditions like text. This allows the model to learn from a much larger and more diverse dataset.
  2. Stronger conditions (like pose) can easily overpower weaker ones (like audio) during training. To prevent this, the training process carefully balances the ratio of different conditions, giving weaker conditions more weight to ensure they are learned effectively.

Training Stages:

  • Stage 0 (Pre-training): The model starts with general text-to-video and text-to-image training, building a foundation of general video generation capabilities.
  • Stage 1: Introduces image conditioning, using a mix of text and image data. Audio and Pose conditions are not introduced yet.
  • Stage 2: Adds audio conditioning, and training with text, image, and audio. Pose is still excluded.
  • Stage 3: The final stage incorporates all conditions (text, image, audio, and pose), carefully balancing their ratios.

This approach, combined with the omni-conditions strategy, enables OmniHuman-1 to learn from a vast amount of mixed data, resulting in its exceptional ability to generate realistic and diverse human videos.

Inference Strategies: Bringing it all together
The inference, or usage stage, uses clever methods to manage different inputs and outputs:

  • Classifier-Free Guidance (CFG) Annealing: This technique helps improve the quality of the generated video by balancing the influence of the conditions (like audio and text) and reducing unwanted artifacts (like wrinkles).
  • Long Video Generation: OmniHuman-1 can generate videos of arbitrary length by using the last few frames of a previous segment as “motion frames” to maintain consistency.

Final Thoughts

Looks like OmniHuman-1 isn’t just hype, it’s a significant step forward in AI-driven video generation. I have tried to explain its basic architecture and training strategy in the research paper. Can’t wait for its public release, is going to be exciting!

Stay tuned for further updates and get ready to witness your photos come alive in ways you never imagined!

Happy Coding!

GIF From GIPHY

--

--

Write A Catalyst
Write A Catalyst

Published in Write A Catalyst

Write A Catalyst and Build it into Existence.

Ishaan Gupta
Ishaan Gupta

Written by Ishaan Gupta

Medium Top Writer. CSE Sophomore. Programmer. Developer. Finance Enthusiast.

Responses (1)