PoseNet: Your Gateway to Gesture Detection

Gaurav Patil

Published in

Globant

4 min readNov 23, 2020

Have you ever wondered how we all have been providing feedback over the years?

We usually write a note and put it in a dropbox, we call up their feedback number or we email them our suggestions.

If you had a chance to spectate our Globant UI Engineering Studio’s annual flagship event, ‘UI Next’, you’d see an application we made to help our guests give feedback using gestures.

Let’s look at a small video to get a glimpse.

Feedback Form

There’s another app on similar lines we made called Squats Calculator. It helps calculate squats of a person performing it in front of their laptop.

What has made this possible is a library called as PoseNet, built on top of TensorFlow platform.

To give you a brief description, TensorFlow is an open source platform for machine learning which provides you with an entire ecosystem of tools to build these ML applications.

PoseNet provides us with pre-trained models necessary to detect user gestures. These pre-trained models run in our browser and this is what differentiates PoseNet from other API dependent libraries. Hence, we don’t have to send our private data to some backend server. The protection of privacy is what gives PoseNet an edge over other API dependent libraries.

Moreover, anyone with a decent webcam equipped desktop or phone can experience this technology right within their web browser. We don’t need some huge server-like resources to get these gestures recognized. We can manage with a bare minimum of what a normal machine has.

What is it that PoseNet really does for us ?

As its name suggests, it estimates poses for us.

Pose estimation refers to computer vision techniques that detect human figures in images and video which we feed via our webcam, so that it is able to determine for us where our elbow, wrist or any other body joint shows up in an image.

The image below demonstrates how PoseNet estimates the key-points or the pose of a user in real time.

PoseNet gives us a total of 17 pose key-points which we can make use of, right from our eyes and ears to our knees and ankles.

What if the image we supply to PoseNet is not clear enough?

PoseNet gives a Confidence score on how accurately it is able to recognize the image or a particular key-point within that image.

As shown in the above image, Person 1 has a confidence score of 0.8 whereas a less visually clear Person 2 has a relatively lower confidence score of 0.7. This is represented as a JSON response.

The ‘score’ denotes the confidence score of that person or a particular key-point. The ‘x’ & ‘y’ denote the respective coordinates of that particular key-point within the image.

If you have any idea in your mind and would like to proceed to code, you can clone the code from here and then go to camera.js file.

In this file, you can see the following snippet of code where the actual response manipulations happen. This is the place where you have to put your creativity to use.

What we receive from PoseNet is a raw piece of JSON information, but how we visualize these 17 key points and confidence score and make use of them as developers, is up to us.

Let’s build something from this that would help people in their day-to-day lives!

Reference & Credit Links

Real-time Human Pose Estimation in the Browser with TensorFlow.js

Posted by: Dan Oved, freelance creative technologist at Google Creative Lab, graduate student at ITP, NYU. Editing and…

medium.com

tensorflow/tfjs-models

This package contains a standalone model called PoseNet, as well as some demos, for running real-time pose estimation…

github.com