MediaPipe blendshape coefficients visualization

Pavel Korobov
Studio Neiro AI
Published in
7 min readJan 26, 2024

In this article, we’ll discuss how to use MediaPipe’s blendshape coefficient estimation and how to animate a blendshape 3D model of a face using it.

At Neiro.AI we create digital avatars that can speak and express themselves with perfectly synchronized facial features. One of the many tools that we use in our work is MediaPipe, a great library made by Google. And among a wide range of different functionalities, such as facial landmark detection or pose estimation, it supports ARKit-compatible blendshape coefficient estimation.

But first of all, what is a blendshape and why would we need it?

Let’s take a look at the image below. We see 3D models of a neutral face and a face with its mouth wide open.

Source: https://arkit-face-blendshapes.com/

But how can we get faces with intermediate jaw openness? Let’s just blend the vertices of these two face shapes!

And we can easily generalize it to the case of 52 expressions (as in ARKit). Let’s just fix a neutral expression and add weighted differences of other expressions with the neutral one. Weights for each difference should lie in the [0, 1] segment.

So with only 52 facial shapes and a neutral one, we can represent a broad continuum of other expressions using blending. For example, we can get a face with its left eye closed and mouth wide open, etc.

Some other examples of ARKit blendshapes from arkit-face-blendshapes.com

Blendshapes are widely used in animation and also can be very beneficial for avatar technologies as a good descriptor of facial motions. Let’s finally see how we can get these coefficients from real images.

First of all, you’ll need MediaPipe to be installed.

pip install mediapipe

wget -O face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Several lines to initialize the model:

python.BaseOptions(model_asset_path='face_landmarker_v2_with_blendshapes.task', delegate= "CPU")
options = vision.FaceLandmarkerOptions(base_options=base_options,
output_face_blendshapes=True,
output_facial_transformation_matrixes=True,
num_faces=1)
detector = vision.FaceLandmarker.create_from_options(options)

Let’s now look at one video from the popular HDTF dataset used by many digital avatar works.

You can find it here. Let’s extract frames from it:

cap = cv2.VideoCapture("resources/WDA_BarackObama_000.mp4")
frames = []
ret = True
while ret:
ret, frame = cap.read()
if not ret:
break
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame)

Then let’s take a close look at one frame from it and visualize the scores:

frame_idx = 100
image = mp.Image(
image_format=mp.ImageFormat.SRGB, data=frames[frame_idx]
)
detection_result = detector.detect(image)
frame_scores = np.array([blendshape.score for blendshape in detection_result.face_blendshapes[0]])
blendshape_names = [blendshape.category_name for blendshape in detection_result.face_blendshapes[0]]
Blendshape coefficients for the chosen frame

Yes, we can definitely see that some coefficients match the facial expression. The match is especially well seen with the coefficients that are responsible for brow movements, and it seems that the model figured out correctly that the jaw is slightly open.

However, this list of coefficients is still hard to interpret (at least for me). Does it really represent this facial expression well? And, as you can imagine, it would be even harder to understand if it represents the dynamics of the facial features well throughout the entire video.

Now we will extract all the coefficients from the video. As the coefficients are a bit noisy, we’ll smooth them in advance. You can experiment with it if you want.

def smooth_array(arr, window=3):
kernel = np.array([1] * window) / window
smoothed_arr = np.ones_like(arr)

for i in range(arr.shape[1]):
smoothed_arr[:, i] = np.convolve(arr[:, i], kernel, mode='same')
return smoothed_arr

video_coeffs = []

for frame in frames:
image = mp.Image(
image_format=mp.ImageFormat.SRGB, data=frame
)
detection_result = detector.detect(image)
frame_coeffs = np.array([blendshape.score for blendshape in detection_result.face_blendshapes[0]])
video_coeffs.append(frame_coeffs)

video_coeffs = np.stack(video_coeffs, axis=0)
# coefficients may be a bit noisy, so we smooth them in advance
video_coeffs = smooth_array(video_coeffs)
np.save("video_coefficients.npy", video_coeffs)
np.save("blenshape_names.npy", blendshape_names)

Animation with Blender

Now we are ready to finally animate some blendshape 3D model!

Don’t worry, I’ve already found and prepared a suitable one. You can find it on GitHub, along with the code used in this article: https://github.com/mynalabsai/blendshapes-visualization/blob/master/resources/blendshapes.blend

Just open this blender project, where I already prepared lights and a camera for rendering. It should look something like this:

The blender project with the blendshape 3D model

If you select the face of the model and then tap on the “data” menu on the right, you’ll be able to see the blendshape (or shape key, as it is called in Blender) coefficients of the model.

You can even play around with the values for some shapes and see how the expression changes. We will substitute the extracted coefficients for shape keys programmatically.

The following script will help us with it:

import bpy
from os.path import join
from pathlib import Path
import os
import numpy as np

cur_dir = "/specify/a/path"
os.chdir(cur_dir)

coeffs = np.load("video_coefficients.npy")
blendshape_names = np.load("blendshape_names.npy")

# reference the active object
o = bpy.context.active_object

output_dir = Path("blender_renders")
output_dir.mkdir(exist_ok=True)

for k in blendshape_names[1:]:
o.data.shape_keys.key_blocks[k].value = 0

for i in range(coeffs.shape[0]):
cur_scores = dict(zip(blendshape_names[1:], coeffs[i][1:]))
for k, v in cur_scores.items():
o.data.shape_keys.key_blocks[k].value = v

bpy.context.scene.render.filepath = output_dir / "out_{i}.jpg"
bpy.ops.render.render(write_still = True)

Copy it into the text editor and run the script (it must be open by default in the project, but you can open the text editor in Blender using Shift + F11). Don’t forget to specify the directory where you put .npy files in the cur_dir variable.

Let’s now combine the frames that appeared in the blender_renders folder using ffmpeg.

ffmpeg -i blender_renders/out_%d.png -start_number 0 out.mp4

And stack this new video with the original one.

ffmpeg -i WDA_BarackObama_000.mp4 -i out.mp4 -filter_complex hstack=inputs=2 stacked_result.mp4

And here’s what we get! As you can see, the animation, especially lip sync, looks quite decent.

Rendering with Pytorch3D

We can also render the animation manually with Python. This time, we’ll only need Blender to extract shape keys from the model. All you need to do is use the simple formulas we discussed at the beginning of this article to animate the vertices of your model.

Select the whole model’s head (with hair), put this script into Blender’s text editor, and run it.

# based on this script: https://gist.github.com/versluis/2e092b466b989b1e91f316599bcce016
# adapted for Blender 4.0

import bpy
import os
from pathlib import Path

cur_dir = "/specify/a/path"
os.chdir(cur_dir)

# Reference the active object
o = bpy.context.active_object

# CHANGE THIS to the folder you want to save your OBJ files in
exportPath = "shape_keys"
Path(exportPath).mkdir(exist_ok=True)

# Reset all shape keys to 0 (skipping the Basis shape on index 0
for skblock in o.data.shape_keys.key_blocks[1:]:
skblock.value = 0

export_materials = False
# Iterate over shape key blocks and save each as an OBJ file
for skblock in o.data.shape_keys.key_blocks[1:]:
skblock.value = 1.0 # Set shape key value to max

# Set OBJ file path and Export OBJ
objFileName = skblock.name + ".obj" # File name = shapekey name
objPath = os.path.join(exportPath, objFileName)
bpy.ops.wm.obj_export(filepath=objPath, export_selected_objects=False, export_materials=False, export_uv=False, apply_modifiers=True)

skblock.value = 0 # Reset shape key value to 0

objPath = os.path.join(exportPath, "_neutral.obj")
bpy.ops.wm.obj_export(filepath=objPath, export_selected_objects=False, export_materials=False, export_uv=False, apply_modifiers=True)

You should get a set of .obj files like here. On MacOS, you can even preview the resulting 3D shapes in .obj files.

I suggest using Pytorch3d for rendering. To ease the installation of Pytorch3d and other libraries we need, you can just copy my environment from the environment.yml.

conda env create -f environment.yml

Let’s import everything and set up the camera for rendering. By the way, if you want to find out more about Pytorch3d, I would recommend you to have a look at their great tutorials: https://pytorch3d.org/tutorials/

import torch
import numpy as np
from tqdm.notebook import tqdm
import imageio
import matplotlib.pyplot as plt
from pytorch3d.io import load_obj
from pytorch3d.structures import Meshes
from pytorch3d.renderer import (
FoVPerspectiveCameras, look_at_view_transform,
RasterizationSettings, MeshRenderer, MeshRasterizer,
SoftPhongShader, PointLights, Textures
)
from pathlib import Path

device = torch.device("cuda")
verts_neutral, faces, _ = load_obj("resources/shape_keys/_neutral.obj")

verts_rgb = torch.ones_like(verts_neutral)[None] # (1, V, 3)
tex = Textures(verts_rgb=verts_rgb.to(device))

# Set up the camera for rendering

distance = -0.3 # distance from camera to the object
elevation = 0.0 # angle of elevation in degrees
azimuth = 0.0 # no rotation so the camera is positioned on the +Z axis.

lights = PointLights(device=device, location=[[0, 0.0, -3.0]])

# Get the position of the camera based on the spherical angles
R, T = look_at_view_transform(distance, elevation, azimuth, device=device)
cameras = FoVPerspectiveCameras(device=device, R=R, T=T, znear=0.1)
raster_settings = RasterizationSettings(
image_size=512,
blur_radius=0.0,
faces_per_pixel=1,
)

renderer = MeshRenderer(
rasterizer=MeshRasterizer(
cameras=cameras,
raster_settings=raster_settings
),
shader=SoftPhongShader(
device=device,
cameras=cameras,
lights=lights
)
)

And this is how we get the rendered neutral expression:

flame_mesh = Meshes(
verts=[verts_neutral.to(device)],
faces=[faces.verts_idx.to(device)],
textures=tex
)

target_images = renderer(flame_mesh, cameras=cameras)
render_np = target_images[0][..., :3].detach().cpu().numpy()

_, ax = plt.subplots()
ax.imshow(render_np)
ax.axis(False);

Let’s now compute the differences between the neutral expression vertices and all the others:

shape_keys_path = Path("resources/shape_keys")
shape_deltas = dict()

for key in blendshape_names:
cur_path = shape_keys_path / (key + ".obj")
cur_verts, _, _ = load_obj(cur_path)
shape_deltas[key] = cur_verts - verts_neutral

Below is a function to render a sequence of vertices:

def render_verts(verts_sequence, faces, texture, output_file="output.mp4"):
writer = imageio.get_writer(output_file, fps=25, macro_block_size=1)

for verts in verts_sequence:
# create a mesh and render it
flame_mesh = Meshes(
verts=[verts.to(device)],
faces=[faces.verts_idx.to(device)],
textures=texture
)
rendered_frame = renderer(flame_mesh, cameras=cameras)
rendered_frame_np = rendered_frame[0][..., :3].detach().cpu().numpy()

# write the rendered frame into a video
writer.append_data(np.uint8(rendered_frame_np * 255))

writer.close()

And we get results similar to the previous ones:

As you can see, MediaPipe predicts blendshape coefficients quite well, and now you can animate blendshape 3D models with it. Thank you for reading this article!

--

--

Studio Neiro AI
Studio Neiro AI

Published in Studio Neiro AI

Craft lifelike video avatars with human features, micro-expressions, and customized AI voices to reflect your brand or speaker persona.