Mastering MIMO: MImicking anyone anywhere with complex Motions and Object interactions

Transform your videos by changing character appearance and motion

6 min readJust now

In today’s deep dive, I’ll walk you through implementing the latest state-of-the-art method for altering character appearances or motions in video. This method, known as MIMO, comes from the Alibaba AI team. It builds on their previous work Animate Anyone, but advances further by incorporating 3D poses instead of just relying on 2D poses. This ensures better coherence in extreme movements and interactions with the environment.

In this article, I’ll cover three key areas to help you implement this method:

Dataset Preprocessing
Training Architecture
Expected Results and Challenges

Dataset Preprocessing

MIMO requires a large dataset, particularly due to the data-hungry nature of the motion aspect. The authors use 5,000 character video clips sourced online, along with 2,000 additional character videos generated using the En3D model. Each clip is sampled, resized and center-cropped to 768x768 resolution with 24 frames for training.

Here’s the step-by-step breakdown:

Human Detection

We start by detecting the human in each video frame using the open-source model Detectron 2 by Meta, which provides human segmentation.

Video Tracking

Temporal coherence of the detected human is achieved with the open-source model Segment Anything Model 2 (SAM 2) by Meta. This results in a temporally consistent human mask throughout the video.

Monocular Depth Estimation

Next, we infer depth to differentiate between the background scene and occlusions by foreground objects. To do so, the model used is the open-source model called Depth Anything by TikTok. Foreground objects are the segmented areas where the mean depth is smaller than the human layer. The scene layer is then obtained by removing the human and occlusion layers from the original video.

3D Poses

To maintain coherence in extreme movements, the model uses the 3D poses of the detected character. We extract them using the open-source model 4D-Humans by Berkeley, which provides the 3D SMPL rig rotations. We then project the 3D joints onto the 2D image using the camera parameters inferred from the model.

Differentiable Rasterization

We use Nvdiffrast by Nvidia to interpolate and generate a continuous 2D feature map from the projected 3D pose data (vertex interpolation).

ID Encoders

Reposing examples from Animate Anyone — Reposing examples by Animate Anyone

To preserve the character’s identity and appearance, we transform an input image of him into an A-pose image with Animate Anyone, then encode its local and global features using ReferenceNet (from Animate Anyone) and CLIP image encoders, respectively.

Video Inpainting

To handle the removal of occlusion objects and humans from the scene, we use ProPainter for video inpainting, ensuring no confusion from mask edges.

Scene & Occlusion Encoder

Comparaison of decoded images by Stable Diffusion

The scene and occlusion videos are encoded with a shared ft-MSE VAE from Stable Diffusion 1.5 (SD 1.5). These embeddings are then concatenated before entering the diffusion process.

Note here that dataset quality is critical. It’s important to filter out the videos that are of poor quality, videos with no moving character, NSFW (Not Safe For Work) or offensive content, etc. Also, diversity in terms of gender, ethnicity, body types, backgrounds, and lighting needs to be ensured to reduce bias in the final results.

Training Architecture

With the dataset ready, let’s delve into the model architecture.

Several components need to be trained:

Pose Encoder

The 2D rasterized images from the 3D joint projections are encoded into a latent space, making the diffusion model training more manageable. We use a convolutional architecture inspired by ControlNet, with 4×4 kernels and 2×2 strides over 16, 32, 64, and 128 channels to process the images (stacked along the time axis).

Diffusion Model

The denoising process leverages a Unet with temporal attention similar to AnimateDiff and Animate Anyone. The noise, scene, and occlusion latents are merged with motion data via 3D convolution and fed into the Unet. Along the Unet blocks, self-attention layers handle local identity features, while cross-attention layers process global features.

Loss Function

The output is a latent clip that can be decoded into a video via the SD 1.5 VAE decoder. The loss function minimizes the difference between predicted and actual noise in the latent clip at each diffusion step.

Expected Results and Challenges

Despite a well-structured architecture, several challenges may arise during training:

Hardware Requirements

Large GPU VRAM is necessary to accommodate the model and temporal latents. The data, because of its temporal nature, takes a lot of space so clip lengths are limited to 24 frames, often sampled at intervals of 4 (similar to AnimateDiff).

ReferenceNet Training

MIMO trains the ReferenceNet from scratch, without using the pre-trained weights of Animate Anyone. Maybe some differences in the architecture make the training of this component compulsory (especially the A-pose reposer). Experimentation here may be necessary.

Code for Animate Anyone

While the official code for Animate Anyone isn’t yet available, unofficial repos from guoqincode, MooreThreads, and novitalabs can serve as starting points for adapting Animate Anyone into MIMO.

Motion Training

The most challenging aspect is training the denoising motion model. Animate Anyone uses a two-stage training process, where individual frames are trained first, followed by motion training. This approach may be needed for MIMO as well.

Thank you for reading! If I’ve missed anything or you have any thoughts, feel free to share them in the comments. I’ll be working on an unofficial implementation of MIMO in the coming weeks. Reach out if you’re doing the same or if you’d like to collaborate!

About the Author: I’m Antoine, a freelance CTO and AI expert. You can follow my work here and contact me via email!