Decoding Meta Sapiens: A Human-Centric AI Model for Precision Tasks

Nandini Lokesh Reddy
12 min readSep 23, 2024

--

Meta has been a leader in developing models for images and videos, and now they’ve added something new: Meta Sapiens. Like “Homo sapiens” (humans), this model is all about humans. It’s designed to perform tasks related to humans, such as understanding body poses, recognizing body parts, predicting depth, and even determining surface details like skin texture.

Why Meta Sapiens Stands Out?

In 2023–2024, many computer vision models have focused on creating realistic human images. While many models for tasks like pose estimation and segmentation exist, Meta’s Sapiens model is specifically designed for human-related tasks.

This blog explains how Meta created this unified model, the pros and cons, and how it compares to other models.

The Three Pillars of Meta Sapiens

Meta claims that a model for human-related tasks should meet these three key qualities:

  1. Generalization: This means the model works well in many different situations. For example, it can handle different lighting conditions, camera angles, and even various types of clothing.
  2. Broad Applicability: The model can do more than one thing. It can estimate poses, recognize body parts, and even predict how far something is from the camera, all without needing big changes.
  3. High Fidelity: It can create high-quality, detailed results. For example, if the task is generating a 3D model of a person, the results will look very realistic, with clear details like facial features and body shapes.

Breaking Down the Architectures:

Meta Sapiens uses some powerful techniques to achieve these tasks. Let’s look at a few of them in simple terms:

MAE (Masked Autoencoder): Think of this as a way to learn efficiently by using a puzzle. The model looks at an image with some pieces missing (like a puzzle with missing pieces) and tries to fill in the gaps. This makes the model better at understanding images and saves time during training. For example, if the model sees a person with a part of their arm missing in the image, it can guess what the arm should look like by understanding the rest of the image.

Using Keypoints and Segmentation: The model identifies 308 points on the human body, including hands, feet, face, and torso. It also knows about 28 different body parts, from hair to lips to limbs, making it very detailed. To train the model, Meta used real human scans and synthetic data, which helped it understand humans in great detail.

1. 2D Pose Estimation — Understanding Human Movement

This task is like giving the model a picture and asking it to guess where key body parts are. The model looks for things like the position of your eyes, elbows, knees, etc. For example, if you upload a photo of someone running, the model can accurately identify where their arms, legs, and head are in the image.

The process works by creating “heatmaps” that show the likelihood of a body part being in a specific spot. The model is trained to minimize errors by adjusting until its guesses (heatmaps) closely match the real positions of body parts.

Architecture:

  • Input: Image (I ∈ R^H×W×3, where H is height and W is width).
  • Step 1: Rescaling the Image — The input image is resized to a fixed height H and width W. This is done to standardize the input size across all images.
  • Step 2: Pose Estimation Transformer (P) — A transformer model processes the image to predict key point locations.

This involves:

Bounding Box Input: A bounding box is provided around the person in the image.

Keypoint Heatmaps: The model generates K heatmaps, where each heatmap represents the probability of a keypoint being at a certain position. For example, one heatmap for the right elbow, another for the left knee, and so on.

  • Step 3: Loss Function (Mean Squared Error) — The loss function used here is Mean Squared Error (MSE). The model compares the predicted heatmaps ŷ ∈ R^H×W×K with the ground truth keypoints y and calculates the difference using MSE:
    L_pose = MSE(y, ŷ)
  • Step 4: Encoder-Decoder Architecture — The pose estimation model uses an encoder-decoder setup. The encoder is initialized with weights from pretraining, while the decoder is initialized randomly. The entire system is then fine-tuned for the task of keypoint prediction.
  • Key Point Difference: Compared to previous models (which might only detect 68 facial points), Meta’s Sapiens model can detect up to 243 facial key points, capturing much finer details around the eyes, lips, nose, ears, and more.

Implementation:

Download the checkpoints for the pose model and follow further steps:

TASK = 'pose'
VERSION = 'sapiens_1b'

model_path = get_model_path(TASK, VERSION)
print(model_path)

Use the “sapiens” model function to define the pose function:

def get_pose(image, pose_estimator, input_shape=(3, 1024, 768), device="cuda"):
# Preprocess the image
img = preprocess_image(image, input_shape)

# Run the model
with torch.no_grad():
heatmap = pose_estimator(img.to(device))

# Post-process the output
keypoints, keypoint_scores = udp_decode(heatmap[0].cpu().float().numpy(),
input_shape[1:],
(input_shape[1] // 4, input_shape[2] // 4))

# Scale keypoints to original image size
scale_x = image.width / input_shape[2]
scale_y = image.height / input_shape[1]
keypoints[:, 0] *= scale_x
keypoints[:, 1] *= scale_y

# Visualize the keypoints on the original image
pose_image = visualize_keypoints(image, keypoints, keypoint_scores)
return pose_image

def preprocess_image(image, input_shape):
# Resize and normalize the image
img = image.resize((input_shape[2], input_shape[1]))
img = np.array(img).transpose(2, 0, 1)
img = torch.from_numpy(img).float()
img = img[[2, 1, 0], ...] # RGB to BGR
mean = torch.tensor([123.675, 116.28, 103.53]).view(3, 1, 1)
std = torch.tensor([58.395, 57.12, 57.375]).view(3, 1, 1)
img = (img - mean) / std
return img.unsqueeze(0)

def udp_decode(heatmap, img_size, heatmap_size):
# This is a simplified version. You might need to implement the full UDP decode logic
h, w = heatmap_size
keypoints = np.zeros((heatmap.shape[0], 2))
keypoint_scores = np.zeros(heatmap.shape[0])

for i in range(heatmap.shape[0]):
hm = heatmap[i]
idx = np.unravel_index(np.argmax(hm), hm.shape)
keypoints[i] = [idx[1] * img_size[1] / w, idx[0] * img_size[0] / h]
keypoint_scores[i] = hm[idx]

return keypoints, keypoint_scores

def visualize_keypoints(image, keypoints, keypoint_scores, threshold=0.3):
draw = ImageDraw.Draw(image)
for (x, y), score in zip(keypoints, keypoint_scores):
if score > threshold:
draw.ellipse([(x-2, y-2), (x+2, y+2)], fill='red', outline='red')
return image

Load the input image:

from utils.vis_utils import resize_image
pil_image = Image.open('path/to/input/image')

if pil_image.mode == 'RGBA':
pil_image = pil_image.convert('RGB')

resized_pil_image = resize_image(pil_image, (640, 480))
resized_pil_image
Source — Google

Output:

from PIL import Image, ImageDraw

output_pose = get_pose(resized_pil_image, model)

2. Body-Part Segmentation — Understanding Human Shapes

In this task, the model classifies every pixel in an image, breaking it down into body parts like arms, legs, or face. For instance, if you upload a picture, the model can separate your face from your hair, and your hands from your arms. This helps in tasks like virtual try-on systems or animated characters.

Meta’s Sapiens model uses a huge vocabulary (28 body parts) to give detailed results. It goes beyond just arms and legs and can distinguish between upper and lower lips, teeth, or even fingers.

Architecture:

  • Input: Image (I ∈ R^H×W×3), similar to pose estimation.
  • Step 1: Encoder-Decoder Architecture — The body-part segmentation model follows the same encoder-decoder setup as pose estimation. The encoder extracts features from the input image, and the decoder converts these features into pixel-wise predictions.
  • Step 2: Pixel Classification — The model classifies each pixel of the image into one of C body part classes (e.g., head, arms, torso, etc.). For instance, C = 20 in standard segmentation, but Meta expands this to C = 28 with a more detailed vocabulary that includes distinctions like upper/lower lips, teeth, and tongue.
  • Step 3: Loss Function (Weighted Cross-Entropy) — The model is fine-tuned using a weighted cross-entropy loss, which compares the predicted body part classes with the ground truth p.
    L_seg = WeightedCE(p, p̂)
  • Step 4: Expanded Vocabulary & Resolution — The Sapiens model uses high-resolution images (4K resolution) and manually annotated over 100K images with these detailed body part labels. The segmentation vocabulary is much larger compared to previous models, giving it a more granular understanding of human body parts.

Note: Despite the advances in body-part segmentation in Meta Sapiens, it still does not achieve the same level of precision as mask-based segmentation models like SAM or SAM2. These models provide more accurate and detailed masks, particularly for fine-grained object boundaries.

Implementation:

Load the segmentation weights and follow the steps:

def get_model_path(task, version):
try:
model_path = SAPIENS_LITE_MODELS_PATH[task][version]
if not os.path.exists(model_path):
print(f"Warning: The model file does not exist at {model_path}")
return model_path
except KeyError as e:
print(f"Error: Invalid task or version. {e}")
return None

# Example usage
TASK = 'seg'
VERSION = 'sapiens_0.3b'

model_path = get_model_path(TASK, VERSION)
print(model_path)

Implement segment function from the “sapiens” model:

def segment(image):
input_tensor = transform_fn(image).unsqueeze(0).to("cuda")

preds = run_model(input_tensor, height=image.height, width=image.width)
mask = preds.squeeze(0).cpu().numpy()

mask_image = Image.fromarray(mask.astype("uint8"))
blended_image = visualize_mask_with_overlay(image, mask_image, LABELS_TO_IDS, alpha=0.5)
return blended_image

Load Input Image:

pil_image = Image.open('sapiens2.jpg')

if pil_image.mode == 'RGBA':
pil_image = pil_image.convert('RGB')

resized_pil_image = resize_image(pil_image, (640, 480))
resized_pil_image

Output:

Outputs from the meta segmentation are not that good of satisfactory results, it’s not showing human body segmentations that clearly; instead Meta’s SAM[segment anything model] segments the image better.

3. Depth Estimation — How Far is That?

Depth estimation helps the model understand how far away different parts of the image are. It’s like giving the model the ability to tell what’s close and what’s far in a photo. For example, in a picture of a person standing near a car, the model can estimate how far the person is from the car, which is important for tasks like augmented reality.

Architecture:

  • Input: Image (I ∈ R^H×W×3).
  • Step 1: Encoder-Decoder Architecture — Similar to body-part segmentation, the encoder extracts features from the image, and the decoder predicts the depth of each pixel.
  • Step 2: Single-Channel Depth Map — The key difference for depth estimation is that the output channel is set to 1, which generates a depth map. This depth map (d̂ ∈ R^H×W) predicts how far each point in the image is from the camera.
  • Step 3: Loss Function (Regression) — The depth estimation task is treated as a regression problem. The model compares its predicted depth values () with the ground truth (d) and minimizes the difference using a regression loss: L_depth = ||d − d̂||1
  • Step 4: Training on Synthetic Data — To improve its depth predictions, Meta Sapiens uses synthetic human data, including 600 high-resolution 3D scans of human figures from RenderPeople. This allows the model to generate detailed and realistic depth estimates even in difficult scenarios.

Implementation:

Load the depth weights:

TASK = 'depth'
VERSION = 'sapiens_0.3b'

model_path = get_model_path(TASK, VERSION)
print(model_path)

Write the Depth function using the “sapiens” model:

def get_depth(image, depth_model, input_shape=(3, 1024, 768), device="cuda"):
# Preprocess the image
img = preprocess_image(image, input_shape)

# Run the model
with torch.no_grad():
result = depth_model(img.to(device))

# Post-process the output
depth_map = post_process_depth(result, (image.shape[0], image.shape[1]))

# Visualize the depth map
depth_image = visualize_depth(depth_map)

return depth_image, depth_map

def preprocess_image(image, input_shape):
img = cv2.resize(image, (input_shape[2], input_shape[1]), interpolation=cv2.INTER_LINEAR).transpose(2, 0, 1)
img = torch.from_numpy(img)
img = img[[2, 1, 0], ...].float()
mean = torch.tensor([123.5, 116.5, 103.5]).view(-1, 1, 1)
std = torch.tensor([58.5, 57.0, 57.5]).view(-1, 1, 1)
img = (img - mean) / std
return img.unsqueeze(0)

def post_process_depth(result, original_shape):
# Check the dimensionality of the result
if result.dim() == 3:
result = result.unsqueeze(0)
elif result.dim() == 4:
pass
else:
raise ValueError(f"Unexpected result dimension: {result.dim()}")

# Ensure we're interpolating to the correct dimensions
seg_logits = F.interpolate(result, size=original_shape, mode="bilinear", align_corners=False).squeeze(0)
depth_map = seg_logits.data.float().cpu().numpy()

# If depth_map has an extra dimension, squeeze it
if depth_map.ndim == 3 and depth_map.shape[0] == 1:
depth_map = depth_map.squeeze(0)

return depth_map

def visualize_depth(depth_map):
# Normalize the depth map
min_val, max_val = np.nanmin(depth_map), np.nanmax(depth_map)
depth_normalized = 1 - ((depth_map - min_val) / (max_val - min_val))

# Convert to uint8
depth_normalized = (depth_normalized * 255).astype(np.uint8)

# Apply colormap
depth_colored = cv2.applyColorMap(depth_normalized, cv2.COLORMAP_INFERNO)

return depth_colored

# You can add the surface normal calculation if needed
def calculate_surface_normal(depth_map):
kernel_size = 7
grad_x = cv2.Sobel(depth_map.astype(np.float32), cv2.CV_32F, 1, 0, ksize=kernel_size)
grad_y = cv2.Sobel(depth_map.astype(np.float32), cv2.CV_32F, 0, 1, ksize=kernel_size)
z = np.full(grad_x.shape, -1)
normals = np.dstack((-grad_x, -grad_y, z))

normals_mag = np.linalg.norm(normals, axis=2, keepdims=True)
with np.errstate(divide="ignore", invalid="ignore"):
normals_normalized = normals / (normals_mag + 1e-5)

normals_normalized = np.nan_to_num(normals_normalized, nan=-1, posinf=-1, neginf=-1)
normal_from_depth = ((normals_normalized + 1) / 2 * 255).astype(np.uint8)
normal_from_depth = normal_from_depth[:, :, ::-1] # RGB to BGR for cv2

return normal_from_depth

Load Input Image:

from utils.vis_utils import resize_image

pil_image = Image.open('/home/user/app/assets/image.webp')

# Load and process an image
image = cv2.imread('/home/user/app/assets/frame.png')
depth_image, depth_map = get_depth(image, model)

Output:

surface_normal = calculate_surface_normal(depth_map)
cv2.imwrite("output_surface_normal.jpg", surface_normal)
# Save the results
output_im = cv2.imwrite("output_depth_image2.jpg", depth_image)

4. Surface Normal Estimation — Understanding Surfaces

This task lets the model figure out the 3D surface details of a human body, such as the angle or direction of a surface at each point. For instance, it can understand the curves of a person’s face or the angles of their arms and legs.

Architecture:

  • Input: Image (I ∈ R^H×W×3).
  • Step 1: Encoder-Decoder Architecture — Like the previous tasks, the normal estimation model uses an encoder-decoder framework. The encoder extracts features from the image, and the decoder is adjusted for normal prediction.
  • Step 2: Three-Channel Output for Surface Normals — For normal estimation, the decoder output channel is set to 3, corresponding to the xyz components of the normal vector. Each pixel gets an xyz value representing the direction the surface at that point is facing.
  • Step 3: Loss Function (Cosine Similarity) — The model uses a combination of L1 loss and cosine similarity to compare the predicted normal vectors () with the ground truth normals (n). The loss is calculated as: L_normal = ||n − n̂||1 + (1 − n · n̂)
  • Step 4: Supervision from Synthetic Data — Just like depth estimation, normal estimation relies on synthetic human data for supervision. This allows the model to make accurate predictions of surface orientation, even in complex cases like curved body parts or extreme poses.

Implementation:

Load the normal weights:

TASK = 'normal'
VERSION = 'sapiens_0.3b'

model_path = get_model_path(TASK, VERSION)
print(model_path)

Define normal function from the “sapiens” model:

import torch
import torch.nn.functional as F
import numpy as np
import cv2

def get_normal(image, normal_model, input_shape=(3, 1024, 768), device="cuda"):
# Preprocess the image
img = preprocess_image(image, input_shape)

# Run the model
with torch.no_grad():
result = normal_model(img.to(device))

# Post-process the output
normal_map = post_process_normal(result, (image.shape[0], image.shape[1]))

# Visualize the normal map
normal_image = visualize_normal(normal_map)

return normal_image, normal_map

def preprocess_image(image, input_shape):
img = cv2.resize(image, (input_shape[2], input_shape[1]), interpolation=cv2.INTER_LINEAR).transpose(2, 0, 1)
img = torch.from_numpy(img)
img = img[[2, 1, 0], ...].float()
mean = torch.tensor([123.5, 116.5, 103.5]).view(-1, 1, 1)
std = torch.tensor([58.5, 57.0, 57.5]).view(-1, 1, 1)
img = (img - mean) / std
return img.unsqueeze(0)

def post_process_normal(result, original_shape):
# Check the dimensionality of the result
if result.dim() == 3:
result = result.unsqueeze(0)
elif result.dim() == 4:
pass
else:
raise ValueError(f"Unexpected result dimension: {result.dim()}")

# Ensure we're interpolating to the correct dimensions
seg_logits = F.interpolate(result, size=original_shape, mode="bilinear", align_corners=False).squeeze(0)
normal_map = seg_logits.float().cpu().numpy().transpose(1, 2, 0) # H x W x 3
return normal_map

def visualize_normal(normal_map):
normal_map_norm = np.linalg.norm(normal_map, axis=-1, keepdims=True)
normal_map_normalized = normal_map / (normal_map_norm + 1e-5) # Add a small epsilon to avoid division by zero

# Convert to 0-255 range and BGR format for visualization
normal_map_vis = ((normal_map_normalized + 1) / 2 * 255).astype(np.uint8)
normal_map_vis = normal_map_vis[:, :, ::-1] # RGB to BGR

return normal_map_vis

def load_normal_model(checkpoint, use_torchscript=False):
if use_torchscript:
return torch.jit.load(checkpoint)
else:
model = torch.export.load(checkpoint).module()
model = model.to("cuda")
model = torch.compile(model, mode="max-autotune", fullgraph=True)
return model

Input Image:

import cv2
import numpy as np

# Load the model
normal_model = load_normal_model(model_path, use_torchscript='_torchscript')

# Load the image
image = cv2.imread("/home/user/app/assets/image.webp")

Output:

Limitations of Meta Sapiens

Even though Meta Sapiens excels at understanding human-related tasks, it faces challenges in more complex scenarios. For example, when multiple people are standing close together (crowding) or when individuals are in unusual or rare poses, the model struggles to accurately estimate poses and segment body parts. Additionally, severe occlusion — when parts of the body are hidden — further complicates the model’s ability to deliver precise results.

Conclusion

Meta Sapiens represents a significant step forward in human-centric AI, offering robust capabilities across pose estimation, segmentation, depth prediction, and surface normal estimation. However, like many models, it still has limitations, particularly in crowded or highly complex scenes. As AI continues to evolve, future iterations of models like Sapiens are expected to address these challenges, bringing us closer to more accurate and reliable human-centric applications.

References:

  1. Sapiens: https://about.meta.com/realitylabs/codecavatars/sapiens/
  2. Github Code: https://github.com/NandiniLReddy/Meta-Sapiens/
  3. Sapiens HuggingFace: https://huggingface.co/facebook/sapiens

--

--