Building Human Foundation Models with SMPL

Published in

Meshcapade

8 min readJun 25, 2024

“You never change things by fighting the existing reality. To change something, build a new model that makes the existing model obsolete.” — Buckminster Fuller

Last month at a talk at Stanford our Chief Scientist, Dr. Michael Black, unveiled our vision of building the world’s first Human Foundation Agent (HFA). A foundation model built on the template of real people — so that it can see us, understand us, and with that understanding it can interact with us like a real person. We are building an autonomous virtual AI agent that can play, dance, learn and teach, just like us. And we’re building this using our 3D human parametric model, SMPL, as a foundation.

When we first created SMPL, the goal was to be able to take the physical reality of humans in the real world, regardless of whatever modality we capture that reality in, and convert it into a compressed digitized format.

With SMPL we were building the 1s & 0s of humans for computers to understand us. With the HFA we are building on these 1s & 0s to teach computers to be like us.

1. SMPL is a compression algorithm.

For all the applications we use, the 3D attributes of a human are always separated into many different parts: 3D shape, motion capture, body measurements, faces, hands, feet, emotions. So every field of research, industry, and even a different team within the same company in each field has to make their own “one part” representation of the human. The assumption here is that since they are only dealing with maybe just the body motion, or just the body shape, or just the hands or just the faces, they don’t have to train a model for the full human. And then they’re stuck with that. They can’t make that representation encode any of the other details about the real, full human. In movies, where they do have to recreate the full human body, motion, faces, expressions and behavior — they have to rely on artists manually, and painstakingly creating this over hours of work. Different artists covering the body shape, different artists covering the skeletal structure, others doing the motion, expressions. It goes on. But that’s obviously a very limited representation because faces or hands or body motion are not independent add-ons to the human-being. Humans are not modular and our bodies don’t function modularly.

SMPL replaces ALL of these separate representations of humans with a single representation. It encodes body shape, posture, motion, faces, expressions, soft tissue deformations, behavior, together with articulation for the body, hands and feet AND all of it in a format that’s differentiable so that it can be fit to any form of input or easily used to train neural networks.

The SMPL parametrization encodes 3D body shape, pose, soft tissue motion, hands, fingers, facial expressions — http://meshcapade.me/

And it does so with just 100 numbers. Every movement, every emotion, every body shape variation you can imagine for any person can be encoded in just 100 numbers. That’s less than 1KB of data. This is incredibly low dimensionality for training any AI model, be it a virtual AI agent or a robot that need to interact with or move like humans.

→ It inverts the 3D reality of humans into a form that computers can “read”.

2. SMPL is the control signal for diffusion models.

Video diffusion models do not have an understanding of humans. They actually don’t really need it — unless you want to add some form of control over the motion of the humans generated in a video. Video diffusion models encode information into a “latent” space without needing to understand what the latent space encoding means. The latent space is just a compression of all the information available to the diffusion model. It doesn’t have to differentiate between the kinds of information encoded, it all just needs to have a statistically coherent structure that the training process can learn from, and it has to be extremely compact. But that means, there is no way to separate and control the representation of what is a human in this statistical hodgepodge.

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

SMPL is a statistical distribution, and the most compact representation of humans. Using SMPL in the encoding gives us the control over the encoding so we can drive the generated video with human motion.

→ It’s like having a joystick to control your character in the game-world of video diffusion models.

3. SMPL is the label for all human behavior.

One of the major obstacles to training AI models to understand human behvior is simply that we don’t have enough words to describe the things we do. Everything in an AI model needs to have a “label” for the computer to create a semantic understanding that can be generated using text, image or video input. For image and video models, text has been a sufficient “label” to help computers learn to generate different kinds of pictures and videos. But when it comes to precise control over the things we humans do, human language just doesn’t have enough labels.

The complexity of human pose — Polina Kovaleva

Quick, describe your exact pose at this moment in words. If you tell this description to a friend over text, will they be able to replicate your exact pose? I doubt it.

We don’t need words to describe a lot of what we do because we are able to simply observe, understand and recreate the pose and movements using our own mental model of human embodiment and pose. At Meshcapade, we are solving the data & labelling problem for AI models using SMPL. We can embed complete human behaviors into AI training just using SMPL’s compact representation of humans.

→ The 100 parameters of SMPL give us the “language” of human behavior for training AI agents.

4. SMPL is the “canvas” for humans in 3D space.

What’s even more brilliant about SMPL is that it also enables 3D human information to be encoded on a consistent canvas. Consider pixels in an image — we always know pixel 0,0 will be the top left corner of an image (or bottom left, depending on your image renderer).

Pixels on a grid — wheatblog’s divArt: painting pixels with divs

For AI and machine learning, SMPL very quickly became the de-facto standard for representing 3D humans because anyone who needs to work with 3D human data — whether it’s 3D scans, 2D images, motion capture — they need the 3D “canvas”. This is the one consistent representation of a human in 3D. In an image, a computer needs to understand the 2D grid representation on which point (0,0) is the top left of the canvas, (20,10) is 20 pixels from the left, 10 pixels down, and so on. With that, the computer can analyze the pixels across millions of images and correlate them with other pixels or even with text. But 3D is far more complex. There are no consistent rows and columns. Before any of this data can be used for AI training, the computer needs to learn a consistent way to identify that point (0,0,0) is, say, the tip of right index finger of a human and point (1,0,0) in the same 3D point cloud is the edge of a table. And so on.

If you have no way to create a consistency across all the possible 3D representations of the world, you’ll end up having to do this over and over again for each new 3D scan or motion data that you get. This is why SMPL became so essential for training new models, whether it’s neural nets or transformers. SMPL can be simply fit to any 3D representation: point clouds, discretized volumes, implicit surfaces, other graphics primitives (NURBS, CAD, etc.) or even to images of people in different scenes, and it will recreate the human in 3D from such data. This gives us a consistent way to represent their own data onto a known 3D canvas, which is the SMPL model. The field of human pose and shape estimation cannot move forward without a consistent 3D representation for humans. This is why SMPL became an essential cornerstone for this.

SMPL also defines a volume in addition to the surface. All implicit shape representations of humans use this to canonicalize 3D space by unposing it with SMPL — very useful! So SMPL does more than provide a consistent canvas (surface).

→ SMPL provides a consistent representation for 3D space on and inside the human.

5. SMPL is a 3D graphics primitive.

During grad school I learned about the primitives coded into 3D graphics programs. Planes, cubes, spheres, cones, cylinders — and of course the Utah teapot. This always struck me as such a cool little detail about how we create all of 3D graphics. Yes, it is triangles all the way down. But, if you turn complex forms into an equation, then you can simply create a “primitive” for it in Maya (or blender).

It’s an in-joke to hide the Utah teapot in animation movies and shorts. It was an easter egg in the Toy Story tea party scene.

With SMPL, we had done just that. We created a 3D primitive for representing humans in 3D for graphics engines. This is just like cylinders, cubes and spheres that we make inside Maya or Blender. Let’s say you create a sphere inside Maya. There is an equation built into Maya’s API that describes the 3D world coordinates for the center of the sphere, the radius and the points on the surface. You can change the sphere in Maya by changing the radius or the center. In the same way, SMPL lets you create a realistic human body in 3D with just a simple equation. Just like the sphere equation, the parameters of the SMPL equation let you create different human body shapes and poses right inside a 3D environment.

→ It’s like any other equation for objects in 3D engines. It just creates a human instead of a sphere or a cube.

SMPL inside Blender — https://github.com/Meshcapade/SMPL_blender_addon

Further reading:

These are some of the papers from just this year’s CVPR 2024 that use SMPL for 3D human pose estimation, motion generation, hand-object interaction, and even training humanoid robots.

I’ll link the X links for the additional explainer in tweet form:

SMPL & 3D human understanding — ChatPose: Chatting about 3D Human Pose: https://x.com/Michael_J_Black/status/1803436127359213824
SMPL & robot motion — HumanPlus: Humanoid Shadowing and Imitation from Humans: https://x.com/zipengfu/status/1801283564350411020
SMPL & 3D pose & shape estimation — TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation: https://x.com/Michael_J_Black/status/1798602004610342928
SMPL & 3D motion control — Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation: https://x.com/MathisPetrovich/status/1802530924552466888
SMPL & 3D motion generation — WANDR: https://x.com/Michael_J_Black/status/1796509572842258602
SMPL & detailed facial expressions — SMIRK:3D Facial Expressions through Analysis-by-Neural-Synthesis: https://x.com/DanecekRadek/status/1802761423585722706
SMPL & hand-object interaction — HOLD: https://x.com/Michael_J_Black/status/1791423005635031265
SMPL & inference of 3D human muscle & tissue — HIT: Human Implicit Tissues: https://x.com/Michael_J_Black/status/1802056037186601162

Also, check out Meshcapade’s Science publications page which contains a lot more publications (past and present) from our team about topics related to humans in 3D: https://meshcapade.com/science

Building Human Foundation Models with SMPL

Written by Naureen Mahmood