3D Human Pose Classification using Mediapipe and PointNet

11 min readApr 26, 2022

Introduction

Given the exponential increase in bandwidth, processing power of consumer devices, and improvements made to cameras in the past decade, video data has rapidly taken over as a preferred medium of communication. It has found increasing uses in many areas ranging from entertainment, to lifestyle, security, communications and healthcare.

In applications where the identification of humans and their respective actions are required, pose estimation models are an invaluable component of image processing. These models identify humans in an image and display keypoints that highlight specific areas of interest such as the head, hands, elbows, shoulders, hips, legs and feet, and enable the classification of positions and actions based on the relative position and configuration of these extracted features. These toolkits have led to the development of many types of applications ranging from exercise posture evaluation, to sign language recognition, as well as body gesture control.

Recently, a few pose estimation models have extended the pose landmark predictions into 3D space. This additional step normalises the orientation of the subject within the image and allows for more robust downstream analysis of the captured landmarks. An issue that plagued applications relying on only 2D landmarks was that downstream classification models may fail to work if the camera was tilted or the subject was not facing the required direction. Extracted features such as angles and distances calculated from 2D landmarks would not resemble what the model was trained to capture, and the predictions would suffer. The inclusion of 3D landmarks offers an additional layer of detail that removes the directional and orientational ambiguity that previously existed in 2D landmark models.

3D landmarks as generated by the Mediapipe model

With 3D landmarks, engineered features such as angles and distances can be derived to be used in classification models. However, extensive domain knowledge is needed to identify the important features to select, and may be tedious to handcraft for every application. A deep learning approach that accepts the raw 3D landmarks as inputs for prediction would streamline and accelerate the development pipeline. The recently developed PointNet model is capable of processing the 3D pointclouds while preserving the spatial features of the input data. In its application in a Yoga pose assessment pipeline, the model has proven itself capable of serving as an effective and efficient downstream classification model for 3D landmarks derived from pose estimation models. In this article, I will describe and explain the architecture required to perform 3D human pose classification tasks using PointNet as a deep learning approach. The various steps required for model training and prediction are outlined below with code examples.

3D Pose Estimation Model — BlazePose (Mediapipe)

There are many pose estimation libraries that facilitate the prediction of 3-dimensional landmark data such as OpenPose, PoseDetection and DensePose. In this article, we will be using one of the models available within Mediapipe’s PoseDetection library known as the BlazePose model for the necessary landmark detections.

Compared to other available state-of-the-art models that require powerful desktop environments, the BlazePose model is able to provide real-time landmark predictions on mobile phones with CPU inference. The model is also capable of detecting a total of 33 keypoints, 16 more than the standard 17 keypoints defined by the Common Objects in Context (COCO) topology. These include keypoints for the face, fingers and feet, and these additional keypoints enable the capture of more intricate semantic information not possible with other models. The combination of these 2 capabilities make BlazePose the preferred model for many real-time video applications such as the tracking of facial expressions and hand gestures, as well as for fitness applications like sports, dance and yoga trackers.

The 33 landmarks defined by the BlazePose model are indicated in the following figure:

33 output landmarks specified by the BlazePose model

The code examples in this article utilise all 33 landmarks for classification. In the event that a specific application only intends to utilise a certain subset of keypoints (e.g. hand keypoints for sign language classification), only the relevant keypoints need to be extracted from the images and passed to the model for training.

Deep Learning Based Classification Model — PointNet

Once the landmarks have been identified from the images, a common technique used to classify poses involves the engineering of various features from the generated landmarks. This can include angles or distances between the various points. These features are then utilised as inputs for downstream classification models to distinguish between various poses. A disadvantage to this technique is that relevant domain expertise is required to identify the discriminatory features that define each type of pose. The application of deep learning methods could potentially improve overall classification accuracy by identifying patterns and features that domain experts may have missed.

Until recently, deep learning approaches for 3D data have been limited because of availability of techniques used to represent the data format. The PointNet model developed by Charles Qi and team from Stanford University in 2017 provides a deep learning framework capable of effectively handling 3D geometric data such as point clouds and meshes. The model is a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input, or per point segment/part labels for each point of the input.

Though the model is frequently used for dense point cloud representations obtained from depth cameras or 3D scanners involving hundreds and thousands of points, it can also handle effectively classication tasks involving small point clouds such as the 3D landmark coordinates generated by the BlazePose model. The PointNet model is permutation invariant, which allows it to handle the unstructured nature of the point cloud data; transformation invariant, which enables it to process classification and segmentation tasks despite rotations and translation; and able to capture interactions and relative positions between the various points. This preserves its robustness in classification tasks when handling the 3D landmark outputs given that the orientation of the landmarks could be rotated or angled differently depending on the position of the camera and/or facing direction of the subject within the scene.

The PointNet architecture as indicated above receives 3D point clouds as inputs, and relies on a set of input and feature transformations to aggregate point features by max pooling. The features are then passed through a subsequent multi-layer perceptron to generate a series of output class scores based on the selected features.

The following sections will explain the framework used to prepare the data for training and subsequent deployment of the model.

Prerequisite Package Versions

Python = 3.7 (Recommended version when Mediapipe is used in conjunction with OpenCV)

OpenCV >= 3.0

Mediapipe >= 0.8.9.1

Tensorflow >= 2.4.1

Necessary Imports

import cv2
import mediapipe as mp
mp_pose = mp.solutions.pose
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import imutils
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Preparation of Training Data

To train the model, a dataset of images of the respective poses will need to be collected. The images should be sorted into train and test sets, as well as into respective folders representing each of the classes.

- “…/train/class_1/”
- “…/train/class_2/”
- “…/train/class_3/”
- “…/test/class_1/”
- “…/test/class_2/”
- “…/test/class_3/”

Successful pose classification requires that the subject of the picture is able to be detected by the BlazePose model. To ensure that the correct and accurate set of landmarks are identified, the following sets of conditions should be met:

- The image should be cropped such that only the subject of interest is present in the frame.
- The subject should be reasonably visible (not camouflaged or partially obscured from view)
- All representative landmarks required for classification should be visible in the frame (e.g. hands for gesture analysis, legs for dance move classification)

Once the data has been prepared, the following code can be used to aggregate to populate the various lists of files and class labels for downstream processing.

train_points = []
train_labels = []
test_points = []
test_labels = []
class_map = {}
# Prepare the gallery folder and image file list
train_image_folder_name = ‘…/train’
train_image_file_list = []
train_image_class_list = sorted(os.listdir(train_image_folder_name))
for i, subfolder in enumerate(train_image_class_list):
file_list = sorted(os.listdir(train_image_folder_name + ‘/’ + subfolder))
for file in file_list:
train_image_file_list.append(train_image_folder_name + ‘/’ + subfolder + ‘/’ + file)
train_labels.append(i)
class_map[i] = subfolder

print(“Train Gallery: %d classes, %d images” % (len(train_image_class_list), len(train_image_file_list)))
# Prepare the gallery folder and image file list
test_image_folder_name = ‘…/test’
test_image_file_list = []
test_image_class_list = sorted(os.listdir(test_image_folder_name))
for i, subfolder in enumerate(test_image_class_list):
file_list = sorted(os.listdir(test_image_folder_name + ‘/’ + subfolder))
for file in file_list:
test_image_file_list.append(test_image_folder_name + ‘/’ + subfolder + ‘/’ + file)
test_labels.append(i)
print(“Test Gallery: %d classes, %d images” % (len(test_image_class_list), len(test_image_file_list)))
print(class_map)

Next, run the following block of code to populate the train_points_list with the x, y, and z coordinates of all the respective training images. Repeat the code again with the test_image_file_list and populate the test_points list.

The code returns a ‘nil’ when the BlazePose model fails to detect a person in the image which may indicate that the image is not usable and will need to be removed from the dataset.

The outputs of the BlazePose model are in the form of a Google Protobuf file, containing the respective x, y, z coordinates along with the visibility score for each of the 33 landmarks. To convert the data into a format suitable for Tensorflow, we loop through the x, y, and z coordinates contained in the Protobuf file to aggregate the landmarks into a list.

for item in train_image_file_list:
frame = cv2.imread(item)

# Extract keypoints from each input file and append to pre-generated list of points
with mp_pose.Pose(
static_image_mode=True, min_detection_confidence=0.5, model_complexity=1) as pose:

# Recolour image to RGB for processing
image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
# Pose detection using model
results = pose.process(image)
try:
landmarks_3d = results.pose_world_landmarks.landmark
landmark_list = []
for lm in np.arange(0,33):
landmark_list.append([landmarks_3d[lm].x, landmarks_3d[lm].y,landmarks_3d[lm].z])
train_points.append(np.array(landmark_list))
except:
train_points.append([‘nil’])

Once the codes are run for both the training images as well as the testing images. The labels are one-hot encoded, the data is augmented to improve the robustness of the training, and the point clouds and respective labels are converted into tensors for compatibility with the Tensorflow model.

def augment(points, label):
# jitter points
points += tf.random.uniform(points.shape, -0.005, 0.005, dtype=tf.float64)
# shuffle points
points = tf.random.shuffle(points)
return points, label
test_labels_onehot = np.array(pd.get_dummies(test_labels))
train_labels_onehot = np.array(pd.get_dummies(train_labels))
train_dataset = tf.data.Dataset.from_tensor_slices((train_points, train_labels_onehot))
test_dataset = tf.data.Dataset.from_tensor_slices((test_points, test_labels_onehot))
train_dataset = train_dataset.shuffle(len(train_points)).map(augment).batch(BATCH_SIZE)
test_dataset = test_dataset.shuffle(len(test_points)).batch(BATCH_SIZE)

Building and Training the Model

The following set of parameters and functions need to be specified for the model to be built. The NUM_CLASSES variable should be set based on the number of classes present in the data. The BATCH_SIZE variable indicates the number of images to use in training per epoch, and will need to be balanced based on the total number of images in the dataset as well as the number of classes.

NUM_CLASSES = 4 # Specify number of classes as required by your application
BATCH_SIZE = 128 # Specify batch size for training
# Convolution and dense layers of the T-net
def conv_bn(x, filters):
x = layers.Conv1D(filters, kernel_size=1, padding=”valid”)(x)
x = layers.BatchNormalization(momentum=0.0)(x)
return layers.Activation(“relu”)(x)
def dense_bn(x, filters):
x = layers.Dense(filters)(x)
x = layers.BatchNormalization(momentum=0.0)(x)
return layers.Activation(“relu”)(x)
# Regularizer function for MLP
class OrthogonalRegularizer(keras.regularizers.Regularizer):
def __init__(self, num_features, l2reg=0.001):
self.num_features = num_features
self.l2reg = l2reg
self.eye = tf.eye(num_features)
def __call__(self, x):
x = tf.reshape(x, (-1, self.num_features, self.num_features))
xxt = tf.tensordot(x, x, axes=(2, 2))
xxt = tf.reshape(xxt, (-1, self.num_features, self.num_features))
return tf.reduce_sum(self.l2reg * tf.square(xxt — self.eye))
# T-net layers for pointnet
def tnet(inputs, num_features):
# Initalise bias as the indentity matrix
bias = keras.initializers.Constant(np.eye(num_features).flatten())
reg = OrthogonalRegularizer(num_features)
x = conv_bn(inputs, 32)
x = conv_bn(x, 64)
x = conv_bn(x, 512)
x = layers.GlobalMaxPooling1D()(x)
x = dense_bn(x, 256)
x = dense_bn(x, 128)
x = layers.Dense(
num_features * num_features,
kernel_initializer=”zeros”,
bias_initializer=bias,
activity_regularizer=reg,
)(x)
feat_T = layers.Reshape((num_features, num_features))(x)
# Apply affine transformation to input features
return layers.Dot(axes=(2, 1))([inputs, feat_T])
```

Once all the required functions have been loaded, the model can be built with the following code:

inputs = keras.Input(shape=(33,3))
x = tnet(inputs, 3)
x = conv_bn(x,32)
x = conv_bn(x,32)
x = tnet(x,32)
x = conv_bn(x,32)
x = conv_bn(x,64)
x = conv_bn(x,512)
x = layers.GlobalMaxPooling1D()(x)
x = dense_bn(x,256)
x = layers.Dropout(0.3)(x)
x = dense_bn(x, 128)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(NUM_CLASSES, activation =’softmax’)(x)
model = keras.Model(inputs=inputs, outputs=outputs, name=”pointnet”)
model.summary()

To train the classification model, the relevant loss function and evaluation metric will need to be selected. Categorical cross entropy is an appropriate loss function in the case of multi-class classification tasks. With regards to the evaluation metric, accuracy can be selected in the case of balanced classes, and f1 score or precision can be selected in the case of imbalanced classes.

model.compile(loss=”categorical_crossentropy”,
optimizer=keras.optimizers.Adam(learning_rate=0.001),
metrics=[“accuracy”] )
model.fit(train_dataset, epochs=20, validation_data=test_dataset)

Visualising Predictions

Once the model has been successfully trained, the following code can be used to test the model by displaying various keypoints and their respective predictions.

data = test_dataset.take(3)
points, labels = list(data)[1]
points = points[:8, …]
labels = labels[:8, …]
labels = pd.DataFrame(labels.numpy()).idxmax(axis=1)
# run test data through model
preds = model.predict(points)
preds = tf.math.argmax(preds, -1) # retrieve class with highest probability
points = points.numpy()
# plot points with predicted class and label
fig = plt.figure(figsize=(15,10))
for i in range(8):
ax = fig.add_subplot(2,4,i+1, projection = “3d”)
ax.scatter(points[i,:,0], points[i,:,1],points[i,:,2])
ax.set_title(
“pred: {:}, label: {:}”.format(
class_map[preds[i].numpy()], class_map[labels[i]] ))
plt.show()

Demo Application — Generating Predictions from Webcam Feed

The trained model can be applied to downstream applications as required. The following code provides an example for detections to be generated through a webcam feed. The predictions are generated and displayed on the webcam in real time.

cap = cv2.VideoCapture(0)
with mp_pose.Pose(min_detection_confidence=0.8, min_tracking_confidence=0.7, model_complexity = 0) as pose:
while cap.isOpened():
ret, frame = cap.read()
# Recolour image to RGB
image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
image.flags.writeable = False
# Make Detection
results = pose.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

# Extraction of landmarks
try:
landmarks_3d = results.pose_world_landmarks.landmark
landmark_list = []
for lm in np.arange(0,33):
landmark_list.append([landmarks_3d[lm].x, landmarks_3d[lm].y,landmarks_3d[lm].z])
preds = model.predict(tf.expand_dims(np.array(landmark_list), 0))
preds = tf.math.argmax(preds, -1)[0].numpy()
# Display predictions on webcam image
cv2.putText(image, “pred: {:}”.format(class_map[preds]),
(50,70), # Change values in array based on webcam input resolution
cv2.FONT_HERSHEY_SIMPLEX, 1, (255,0,0), 2, cv2.LINE_AA)
except:
pass
# Render detections
mp_drawing.draw_landmarks(image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS,
mp_drawing.DrawingSpec(color=(20,250,20), thickness=2, circle_radius=2),
mp_drawing.DrawingSpec(color=(20,20,250), thickness=2, circle_radius=2))
cv2.imshow(‘Webcam Feed’, image)
# Press q to end stream
if cv2.waitKey(10) & 0xFF == ord(‘q’):
break
cap.release()
cv2.destroyAllWindows()

Conclusions

The above application uses a combination of the BlazePose 3D pose estimation model and PointNet deep learning classification model to provide accurate, real-time predictions of poses. Given that the orientation of the subject is normalised into a 3D space, this workflow has an edge over conventional classification models that utilise 2D landmark data and could lead to improved model robustness and accuracy. The workflow is easily customisable and can be integrated into a variety of different applications depending on user requirements.

Acknowledgements

Nicolai Nielsen for a comprehensive step-by-step PointNet implementation code tutorial made available on Youtube and Github. Portions of the code provided in this article has been adapted from his implementation example referenced below.

Profs Tian Jing and Tan Jen Hong from the National University of Singapore Institute of Systems Sciences for coverage of various aspects of Intelligent Sensing Systems.

References

For more information on BlazePose or other pose classification models, the following links may be beneficial to provide some additional context:

For more information about the PointNet model used in this workflow, the following links are helpful and can explain the specifics of the model in more detail: