Pose estimation and matching with TensorFlow lite PoseNet model

Beginner’s guide to using TensorFlow lite models with Python and solving the pose matching task

Ivan Kunyankin
7 min readMay 28, 2020


In this post we’ll briefly go over running TensorFlow lite model with Python, parsing its output and solving the pose matching task commenting every meaningful piece of the code. It can be helpful for those who are starting to get familiar with TensorFlow and those encountering the pose matching task.

TensorFlow provides the community with pre-trained models for a variety of tasks. These models can help you iterate through the research process without training the model from scratch. But the running process of such a model can be a little confusing.

Photo of me in Lisbon, Portugal by Ana_Strem

This post will cover the following topics:

  1. Running TensorFlow lite model with Python
  2. Parsing PoseNet’s output
  3. Pose matching

We’ll go over each section and talk about the main parts. And you can view and run the whole code with this Google Colab notebook.

Running TensorFlow lite model with Python

Using small TensorFlow lite models makes sense not only while working with smartphones but also if you want to run your model on a portable device like Raspberry Pi or Google coral dev board, for example. This is where Python can be helpful.

First of all, as the TensorFlow’s documentation states (I’ll be referencing it a lot in this post), to run the model we need Interpreter. You can do both install/import all TensorFlow packages or just TensorFlow Lite Interpreter (which consumes less memory).

I assume that you already have TensorFlow installed (at the moment its version is 2.2.0) and the model downloaded. Now we’ll initialise Interpreter with the path to our model and allocate tensors for it (this operation allocates memory in a specific way to ensure minimal load, initialisation, and execution latency)

path = "posenet_mobilenet_v1_100_257x257_multi_kpt_stripped.tflite"interpreter = tf.lite.Interpreter(model_path=path)

After our Interpreter is ready we can extract information about our model’s input shape preferences as well as information about the output to know which tensors to address later on. You can print this lists to see what other useful information it provides.

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
height = input_details[0]['shape'][1]
width = input_details[0]['shape'][2]

Now it’s time to resize our input images for them to be the size of the model’s input. You can find the code in that notebook as well

Now what’s interesting is that before running TensorFlow lite model we need to check the type of its input values. To put it simply, whether it should contain integer or float values. The thing is that some models can be quantised in order to reduce inference time and model size, which means their weights and activations get converted to 8-bit integers, for example.

Size reducing happens because less space is needed to store 8-bit integers than 32-bit floats. The difference is noticeable when the model has several million parameters. Inference time is reduced by simplifying calculations of mathematical operations performed upon activations and weights. Obviously, multiplying integers is easier and faster than multiplying floats. Moreover, some accelerators (like Coral Edge TPU) require inputs for the models to be integers as well.

Nevertheless, this optimisation technique has some trade-offs. Lowering parameter value precision usually affects model’s accuracy.

In our case, the model requires incoming images to have float values which we can also normalise to be in the range from 0 to 1 by subtracting the default mean and dividing by default standard deviation.

float_model = input_details[0]['dtype'] == np.float32if float_model:
template_input = (np.float32(template_input) - 127.5) / 127.5
target_input = (np.float32(target_input) - 127.5) / 127.5

Now all there’s left for us to do is the following:

# Set the value of the input tensor
interpreter.set_tensor(input_details[0]['index'], template_input)
# Run the calculations
# Extract output data from the interpreter
template_output_data = interpreter.get_tensor(output_details[0]['index'])
template_offset_data = interpreter.get_tensor(output_details[1]['index'])

Basically, these are the primary actions we need to perform to run this TensorFlow lite model. In the next section we’ll discuss how to visualise the result.

Parsing PoseNet’s output

Those arrays we just obtained don’t tell us much about the pose of a person on the image. In this section we’ll discuss how to process the model’s output to actually visualise the pose.

The output consist of 2 parts:

  1. Heatmaps (9,9,17) — correspond to the probability of appearance of each keypoint in the particular part of the image (9,9)(without applying sigmoid function). They are used to locate the approximate position of the joints.
  2. Offset vectors (9,9,34) — these are used for more exact calculation of the keypoint’s position. First 17 of the third dimension correspond to the x coordinates and the second 17 of them correspond to the y coordinates
Probabilities of appearance of the nose across different parts of the image (source photo by Ana_Strem)

As you can see, the image above has 9 by 9 grid of values and it is one of 17 heatmaps. The maximum value is in the cell that is right under the face. Obviously, there is no nose there, but as I mentioned earlier, with heatmaps we can find just the approximate positions of the joints. After finding the index for the maximum value we upscale it with output stride value and the size of the input tensor. After that we can adjust the positions with offset vectors.

Here’s the pseudocode (for better understanding of the concept) for parsing the output:

def parse_output(heatmap_data, offset_data, threshold):  for every keypoint in heatmap_data:
1. find indices of max values in the 9x9 grid
2. calculate position of the keypoint in the image
3. adjust the position with offset_data
4. get the maximum probability

if max probability > threshold:
if the position lies inside the shape of resized image:
set the flag for visualisation to True

We added the flag to each keypoint to be able to filter out those that the model is not sure of and those it predicts to be outside of the image. The actual code you can find here.

After that, we can draw every keypoint on the image to visualise the result. Here are the images containing the template and the target poses that we will be matching:

The yellow points are the predicted coordinates of each keypoint (source photo by Ana_Strem)

By the way, the easiest way to draw the point on the source image that have higher resolution is to calculate the resize ratio for each axis separately and then multiply the coordinates of the points by the corresponding ratio value.

Pose matching

If your task includes pose matching you need to figure out the way to compare positions of different body parts and their relationships with each other on the image.

The idea behind my approach is to compare the angle between each body part and x-axis for both images as well as calculate proportions between body parts for each image and compare them to cover the cases when, for example, hands move in parallel with the camera view direction so that the angle doesn’t change.

Let’s first calculate these angle and size values for each body part.

def angle_length(p1, p2):  angle = math.atan2(- int(p2[0]) + int(p1[0]), int(p2[1]) -int(p1[1])) * 180.0 / np.pi  length = math.hypot(int(p2[1]) - int(p1[1]), - int(p2[0]) + int(p1[0]))  return round(angle), round(length)

Please disregard the formula. Given the position of the origin in opencv (upper-left corner) the formula has been written this way to make the process of choosing the acceptable difference between poses more clear for the user — meaning that one can choose the difference having in mind the origin being in the lower left corner.

Here’s again the pseudocode for the matching formula. The exact code can be found in the very same notebook.

def matching(template_kp, target_kp, angle_deviation=20, size_deviation=1):  1. set anchor sizes for proportions calculations - distance between shoulders  For each body part that we calculated angle and size for:
1. Calculate difference between angles
2. Calculate ratio between the part and the anchor for the template pose (proportion)
3. Calculate ratio between the part and the anchor for the target pose (proportion)

if difference between angles > angle_deviation threshold:
the body part is deviated
elif difference between proportions > size_deviation threshold:
the body part is deviated
return the list of deviated body parts

That’s pretty much it. The main advantage of this approach is the customisation for the acceptable pose deviation. Using angle and size characteristics of each body part we can also calculate a scalar value representing the matching measure. Anyway, now we can draw deviated body parts with a different color or do whatever else we need to do.

Yellow lines represent body parts position of which match and red lines represent those that don’t match (source photo by Ana_Strem)

Now that we discussed how to estimate and match poses we can scale the algorithm to work with videos and streams and add the ability to match poses when we don’t have all the keypoints found in the image.

I hope this little guide will be useful for someone. Please let me know if you have any questions. You can also reach out to me via LinkedIn



Ivan Kunyankin

Sr. Data Scientist at Devexperts. NLP engineer

Recommended from Medium


See more recommendations