Comparison of Human poses with PoseNet

Priyanka Garg
6 min readMay 12, 2019

--

Photo by Indian Yogi (Yogi Madhav) on Unsplash

What is Pose Estimation?
Pose estimation refers to the set of computer vision techniques which identify human figures in images or videos, and detect the position of key body parts(joints actually).

In other words: For a given image of a person, through 2D pose estimation, we would be able to map the positions of his/her elbows, shoulder, knees ankle etc.. on a 2D graph.This could let us in someway predict if the person is standing, walking or dancing.

What is PoseNet?

PoseNet is a machine learning model that estimates human pose in real-time. It is an open source project and makes it possible to estimate human pose, with just Javascript.

Now developers can easily make software related to movement/poses without having to be an expert in ML or deep learning . It abstracts low-level computer vision techniques and takes away the overhead of setting up complicated back-ends and APIs. Anyone having with a decent webcam or phone now has access to this technology. It has unlimited applications, interactive devices that respond to the body to augmented reality, animation, fashion , fitness and more.

Importing the TensorFlow.js and PoseNet Libraries

Getting access to this awesome library is easy as :

  1. installing with npm:
npm install @tensorflow-models/posenet

and importing using es6 modules:

import * as posenet from '@tensorflow-models/posenet';const net = await posenet.load();

2. or getting the bundle in the html page:

<html>
<body>
<!-- Load TensorFlow.js -->
<script src="https://unpkg.com/@tensorflow/tfjs"></script>
<!-- Load Posenet -->
<script src="https://unpkg.com/@tensorflow-models/posenet">
</script>
<script type="text/javascript">
posenet.load().then(function(net) {
// posenet model loaded
});
</script>
</body>
</html>

Single-person Pose Estimation

PoseNet can be used to estimate either a single pose or multiple poses. I am using the single pose detector. When an input image is fed to it, we get pose confidence scores, keypoint positions, and keypoint confidence scores

I have only used the single-pose estimation algorithm because it is the simpler and faster of the two. It’s disadvantage is that, when capturing an input from the webcam, we have to make sure there is only one person in the frame. The presence of 2nd person in the frame causes that person’s joints taken in account.

The inputs for the single-pose estimation algorithm are:

  • Input image element
  • Image scale factor — lower scale factor is supposed to increase speed at the cost of accuracy. For our purposes we don’t have to worry about speed.
  • Flip horizontal — Depends on the webcam you are using. If your webcam flips images, set this to true.
  • Output stride — Must be 32, 16, or 8. Defaults to 16. Internally, this parameter affects the height and width of the layers in the neural network. At a high level, it affects the accuracy and speed of the pose estimation. The lower the value of the output stride the higher the accuracy but slower the speed, the higher the value the faster the speed but lower the accuracy. The best way to see the effect of the output stride on output quality is to play with the single-pose estimation demo.

The outputs for the single-pose estimation algorithm:

  • Pose — an object that contains a array of keypoints and confidence score for each detected person. (we are using single pose, so we get only one confidence score)
  • Pose confidence score — accuracy of the estimation on a scale of 0.0 to 1.0. We can use this to filter out the not-so-good responses.
  • Keypoint — the part of the body such as the nose, left knee, right foot, etc. It contains both a position and a keypoint confidence score.
  • Keypoint Confidence Score — accuracy of key points expressed on a scale of 0.0 to 1.0.
  • Keypoint Position — 2D x and y coordinates in the original input image where a keypoint has been detected.

PoseNet currently detects 17 keypoints
-5 of them are facial points(both eyes, both ears and nose)

-12 of them are the various joints of the body(shoulders, elbows, wrists, hips, knees and ankles all in pairs)

This short code block shows how to use the single-pose estimation algorithm:

An example of a part of the pose object looks like the following:

I was looking to compare this benchmark pose by one person, to an image of a second person trying to imitate her. The pose of the imitation looks like this :

I found this article by Jane Friedhoff and Irene Alvarado, Creative Technologists, Google Creative Lab.

I am going to explain in simple words how they have compared poses.

Pose matching: using cosine similarity

PoseNet returns the x and y position of each key-point in relation to the input image, plus an associated confidence score

We have to convert the incoming JSON into an array of keypoint co-ordinates (17 pairs of X and Y co-ordinates). The confidence scores are ignored here.

Image source : here

The 17 keypoints are converted into a vector and plotted in high dimensional space. This vector plotting is compared to another vector plot from our benchmark image.

The direction of vectors here are an indication of the similarity of the poses. Vectors going in similar directions are similar, while those going at fairly different or opposite directions are different.

A visual depiction of cosine similarity, via Christian Perone.

To take into account, the variation in width/heights of the images, and the position of the person within the image, we need to

  1. Resize and scale: crop image to bounding box coordinates( that means to crop out everything but the person) and scale each image to a consistent size.
  2. Normalize:
    Quoting the author’s of Move Mirror

We further normalized the resulting keypoints coordinates by treating them as an L2 normalized vector array.

What that means?
So we saw the direction of the vectors were important and not the magnitude. To normalize a vector, is to take a vector of any length and, keeping it pointing in the same direction, change its length to 1, turning it into what is called a unit vector. Which means we are ignoring the size of the picture, but keeping in account the direction of the vector, created by the pose inside of that image.

A vector scaled with L2 normalization. Image source : here

The two steps described above can be thought of visually as follows:

Steps taken to normalize Move Mirror data. Image source : here

After calculating cosine similarity, we use the below formula to arrive at a euclidean distance . Euclidean distance is a measure of how different two data samples represented in vector are.

They have provided the following javascript code for this:

Euclidean distance becomes important because two vectors pointing in similar directions, but spaced far apart could mean that the poses are fairly unsimilar

for example when I compared two very similar poses by 2 different people, the output I got was:

cosine similarity: 0.991559292031104
euclidean distance: 0.12992850317690846

This indicates that poses are fairly similar. However, when I compared two different poses by the same person, the output I got was :

cosine similarity: 0.8941908753361582
euclidean distance: 0.46001983579807043

since the euclidean distance here is high,, I can know that the poses are fairly different, even they seem to be related by cosine similarity.

With this, we are all set to start comparing poses, and apply this knowledge to endless number of applications.

References:

  1. https://medium.com/tensorflow/move-mirror-an-ai-experiment-with-pose-estimation-in-the-browser-using-tensorflow-js-2f7b769f9b23
  2. https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5

--

--

Priyanka Garg

FullStack software engineer. loves to code. Has worked with different technologies(GCC compiler, embedded systems, wireless)