AI Dance based on Human Pose Estimation

Published in

Nerd For Tech

4 min readNov 21, 2020

A Human Pose Skeleton represents the orientation of a person in a graphical format. Essentially, it is a set of coordinates that can be connected to describe the pose of the person. Each co-ordinate in the skeleton is known as a part (or a joint, or a keypoint). A valid connection between two parts is known as a pair (or a limb). A sample human pose skeleton is shown below.

So, In this article, we will look how to use a Deep Neural Net model for performing Human Pose Estimation in OpenCV.

Table of Content

Datasets
Model Architecture
Experiments and Results

Datasets

Till now, Human Pose Estimation was challenging problem because of lack of high quality datasets. Now a days, every AI challenge is needs a good dataset away from demolished. In last few years, challenging datasets has been released which have made it easier for researchers to solve the problem efficiently.

Some of the datasets are :

COCO Keypoints challenge
MPII Human Pose Dataset
VGG Pose Dataset
SURREAL (Synthetic hUmans foR REAL tasks)
UP-3D

For this article, we have used COCO Dataset for Human Pose Estimation.

Model Architecture

OpenPose first detects parts (keypoints) belonging to every person in the image, followed by assigning parts to distinct individuals. Shown below is the architecture of the OpenPose model.

The model takes as input a color image of size w × h and produces, as output, the 2D locations of keypoints for each person in the image. The detection takes place in three stages :

Stage 0: The first 10 layers of the VGGNet are used to create feature maps for the input image.
Stage 1: A 2-branch multi-stage CNN is used where the first branch predicts a set of 2D confidence maps (S) of body part locations ( e.g. elbow, knee etc.). Given below are confidence maps and Affinity maps for the keypoint. The second branch predicts a set of 2D vector fields (L) of part affinities, which encode the degree of association between parts.
Stage 2: The confidence and affinity maps are parsed by greedy inference to produce the 2D keypoints for all people in the image.

Steps involved in human pose estimation using OpenPose. (Source)

Experiments and Results

In this section, we will load the trained model for understanding Human Pose Estimation on a single person for simplicity. Here are the steps :

Download the model weights from here.

Load the network : We are using models trained on Caffe Deep Learning Framework. Caffe models have 2 files –

.prototxt file which specifies the architecture of the neural network .
.caffemodel file which stores the weights of the trained model

Read Image and Prepare Input to the Network : The input frame that we read using OpenCV should be converted to a input blob ( like Caffe ) so that it can be fed to the network. This is done using the blobFromImage function which converts the image from OpenCV format to Caffe blob format. First we normalize the pixel values to be in (0,1). Then we specify the dimensions of the image. Next, the Mean value to be subtracted, which is (0,0,0).

Make Predictions and Parse Keypoints : Once the image is passed to the model, the predictions can be made. The output is a 4D matrix :

The first dimension being the image ID ( in case you pass more than one image to the network ).
The second dimension indicates the index of a keypoint. The model produces Confidence Maps and Part Affinity maps which are all concatenated. For COCO model it consists of 57 parts — 18 keypoint confidence Maps + 1 background + 19*2 Part Affinity Maps.
The third dimension is the height of the output map.
The fourth dimension is the width of the output map.

Draw Skeleton : We can draw the skeleton when we have the keypoints by just joining the pairs.

The output of above code is :