Head Pose Estimation with MediaPipe and OpenCV in Javascript

Susanne Thierfelder
5 min readNov 15, 2022

--

In this blog post, I demonstrate how to estimate the head pose from a single image using MediaPipe FaceMesh and OpenCV in Javascript. Check out my demo on CodePen!

Landmarks for pose estimation

Currently, MediaPipe JavaScript Solution API does not include the option ‘enableFaceGeometry’ which allows obtaining the pose of each detected face automatically (see issue #2673).

In this article, I will look at the principles of pose computation from 3D-2D point correspondence, explain the relevant OpenCV algorithms, talk about landmarks in MediaPipe, and finally show you how to display the pose of the face.

Roll, pitch, and yaw angles for head pose estimation

Steps to estimate the face’s yaw, pitch, and roll angles in a given image :

1) Find face landmarks using Mediapipe ‘FaceMesh’
2) Produce rotation vector with OpenCV-Javascript solvePnP function
3) Pass rotation vector to OpenCV Rodrigues function to get rotation matrix
4) Finally, decompose the rotation matrix to get Euler angles

The pose computation problem consists in solving for the rotation and translation that minimizes the reprojection error from 3D-2D point correspondences (OpenCV, see also publication). The reprojection error is a geometric error corresponding to the image distance between a projected point and a measured one (see Wikipedia).

MediaPipe vs OpenCV coordinate systems

cv.solvePnP: Finds an object pose from 3D-2D point correspondences. This function returns the rotation R and the translation vectors t that transform a 3D point expressed in the world frame into the camera frame:

Pose computation overview (OpenCV)
Points expressed in the world frame are projected into the image plane [u,v] using the perspective projection model Π and the camera intrinsic parameters matrix A (also denoted K in the literature).
Points expressed in the world frame are projected into the camera frame.

The camera intrinsic parameters include the focal length f, the optical center c, also known as the principal point, and the skew coefficient (read more here).

We need an array of 3D object points in the object coordinate space (landmarks of the reference model in MediaPipe), an array of corresponding image points (2D landmarks of multiFaceLandmarks in MediaPipe), intrinsic camera matrix, distortion coefficients (for simplicity we assume that there is no distortion).

Here we use ‘SOLVEPNP_ITERATIVE’ as a pose computation method which uses a non-linear Levenberg-Marquardt minimization scheme. The initial solution for non-planar “objectPoints” needs at least 6 points and uses the Direct linear transformation algorithm.

The output brings points from the world coordinate system to the camera coordinate system :

  • Output rotation vector (see Rodrigues)
  • Output translation vector

Rodrigues’ rotation formula is an efficient algorithm for rotating a vector in space, given an axis and angle of rotation (Wikipedia).

cv.Rodrigues: Converts a rotation matrix to a rotation vector or vice versa. Here we pass the rotation vector to get a rotation matrix.

cv.RQDecomp3x3: function not found in this OpenCV version, we calculate Euler angles by hand (see rotationMatrixToEulerAngles).

sy = math.sqrt(R[0,0] * R[0,0] +  R[1,0] * R[1,0])

singular = sy < 1e-6

if not singular :
x = math.atan2(R[2,1] , R[2,2])
y = math.atan2(-R[2,0], sy)
z = math.atan2(R[1,0], R[0,0])
else :
x = math.atan2(-R[1,2], R[1,1])
y = math.atan2(-R[2,0], sy)
z = 0

For each face, MediaPipe FaceMesh contains a bounding box of the detected face and an array of 468 keypoints. Each keypoint or facial landmark has 2D (x, y) or 3D (x, y, z) coordinate locations of facial features, such as lips or eyes corners, points on the eyebrows, irises, and face contours, and intermediate points on the cheeks and forehead.

MediaPipe FaceMesh landmarks with yellow landmarks used for pose estimation (see here)

For the keypoints, x and y represent the actual keypoint position in the image pixel space. z represents the depth with the center of the head being the origin, and the smaller the value the closer the keypoint is to the camera. The magnitude of z uses roughly the same scale as x.

The indices of the landmarks we are interested in are 1, 33, 263, 61, 291, and 199 which are evenly distributed on the face.

To display the pose of the face as a coordinate system on the nose tip we project the nose landmark to an image plane.

cv.projectPoints: Projects 3D points in world coordinates to an image plane in camera coordinates using our previously calculated rotation R (Rodrigues) and translation vector t, camera matrix K, and distortion coefficients.

Finally, we draw a line from the 2D nose tip to the projected nose tip.

Final pose estimation result with MediaPipe landmarks in white, the estimated pose as a coordinate system on the nose, and landmarks projected using the estimated pose in cyan.

References

  1. GitHub repository: https://google.github.io/mediapipe/
  2. Solution “MediaPipe FaceMesh” : https://google.github.io/mediapipe/solutions/face_mesh
  3. Nicolai Nielson’s Demo in Python : https://github.com/niconielsen32/ComputerVision/blob/master/headPoseEstimation.py
  4. Mike C. : A head pose estimation Cycle.js demo app using opencv.js and tensorflow.js’ posenet https://codesandbox.io/s/008olz2wmn?file=/src/index.js:1273-1277
  5. Camera Calibration : https://fr.mathworks.com/help/vision/ug/camera-calibration.html
  6. Learn OpenCV : https://learnopencv.com/head-pose-estimation-using-opencv-and-dlib/
  7. Grab the code from here — https://codepen.io/Susanne-Thierfelder/pen/yLEOQXq

--

--