An approach towards low cost computing on the edge for vision based AI applications
Pose estimation is a computer vision approach to detect various important parts of a human body in an image or video. It gives pixel locations of where eyes, elbows, arms, legs, etc are for one or more human bodies in an image. The algorithm gives locations of “joints” of a body. However pose is a broader subject where-from we are only focusing on human body pose estimation. None of the algorithms are perfect and are heavily dependent on the training data.
How is it useful?
Human pose detection on the edge can be used to read body language and body movement in real-time at the same location as the person/s. This enables numerous applications in Security, Retail, Healthcare, Geriatric care, Fitness, Sports domains. Coupled with Augmented/Mixed Reality, we can transpose a human into a virtual world thus opening up newer opportunities and experiences in Fashion retail, Entertainment, Advertising and Gaming. Along with gesture recognition you can interact with the virtual world.
What is Myriad NCS?
If you have not heard of Intel’s Neural Compute Stick, it is a small device that plugs in via USB port and runs deep neural networks. Think of it as a USB graphics card that is optimised to run certain deep learning frameworks and models. Being a USB device, it can be run on an edge computing device such as a Raspberry Pi. It is low powered and comparatively small. These points make it a very good choice to run machine learning models on edge. If you are looking for something more embedded you can look at the VPUs from Intel.
OpenVINO provided OpenPose Model
OpenVINO provides a set of pre-trained models which can be run on Movidius NCS without having to go through the conversion process. One of the pre-trained models is human-pose-estimation. It is a multi-person model, based on MobileNet V1 and trained using caffe framework.
This model is a larger architecture based on OpenPose. The complexity is 15GFlops with 42.8% average precision on COCO dataset. The high complexity of the model is a bottleneck, rendering the option unusable on edge for real time detection. During our benchmarks, the model gave 2FPS on Movidius NCS 1. However, the accuracy was higher than PoseNet.
Tensorflow JS Posenet Model
In brief, the model is based on MobileNet V1 and is trained to detect single-person or multi-person poses. The model is optimised to run on Tensorflow JS which means it is light enough to run in a web browser.
Here is an overview of what we are going to do:
- Convert Tensorflow JS model to a normal Tensorflow model
- Install OpenVINO
- Convert Tensorflow model to OpenVINO supported format
- Run the model on Movidius NCS
Convert tfjs to Tensorflow
You can take one of the following 3 ways to get a .pb file:
- Download the files generated by us: click here to download
- Convert it yourself using tfjs-converter
- Use this repo, which downloads and converts the tfjs models for you
The simplest way is to download the ones we have given. That way you don’t have to install extra stuff on your computer and worry about the process of conversion.
As you will notice, there are 3 important files:
These files refer to different version of MobileNet on which the pose estimator has been trained. To simplify, 050 is the fastest with low accuracy, 075 has more accuracy but is slower than 050. Lastly, 100 is the slowest but the most accurate among the three.
Which one should you choose? Keep reading, we are going to evaluate which model gives the best trade-off of accuracy and speed soon!
To be able to run the model on Movidius NCS, we are going to use Intel’s distribution of OpenVINO toolkit. OpenVINO can be installed on Linux, Windows & Raspbian OS. You can follow the official instructions to install the toolkit. We have installed the toolkit on Ubuntu 16.04 to convert the model, and used Raspbian to run the model.
Install OpenVINO toolkit on your Linux machine. Keep in mind that you won’t be able to convert a tensorflow model to OpenVINO supported format on a Raspberry Pi, so this installation is a must (or install it on Windows).
Install OpenVINO toolkit on Raspbian. Raspbian installation of the toolkit only has inference engine. Which means you cannot convert your tensorflow (or caffe, MXNet) models to Intermediate Representation supported by OpenVINO, you will only be able to run inference on already converted models.
Next, we are going to:
- Convert tensorflow model to Intermediate Representation on a Linux machine
- Run inference on Raspberry Pi
Convert Tensorflow Model to OpenVINO Intermediate Representation
Intermediate Representation (IR) of a model is a file format recognised by OpenVINO toolkit, which is optimised to run on edge computing devices such as Movidius NCS.
Run the following command in your terminal:
This will give you two files: model-mobilenet_v1_075.mapping and model-mobilenet_v1_075.xml. These files are necessary to run inference on Movidius NCS.
You can replace — input_model with other versions of PoseNet (050 and 100) to get Intermediate Representations.
Transfer the two files on your Raspberry Pi and continue to the next step!
Running Inference on Raspberry Pi
Assuming you have installed OpenVINO toolkit on your Raspberry Pi and have transferred .mapping and .xml files, it is time to clone the repository .
The repository contains code to run benchmarks on Movidius. The code does not perform any image post processing to get proper benchmarks and to keep things simple. You can write OpenCV layer to render the key points on top of your input image.
Make sure your Movidius NCS is attached to the Raspberry Pi. Download an image of a person from the Internet and save it. Let’s call the downloaded image’s location $IMAGE_PATH. Next, move your model-mobilenet_v1_075.xml and model-mobilenet_v1_075.mapping files to the repository’s root.
Execute the following command in your terminal to run inference on Raspberry Pi:
The smallest model performs the fastest, with 42 frames per second! Check out the videos to understand how accurate each of them are:
We recommend you use 075 version, because 30 FPS is smooth enough for human eyes to consider it real time, and the accuracy is acceptable too for many use cases. However, you might want to consider another version depending upon your use case.
~ This article is brought to you by the good hearted hackers from Oviyum. If you would like to consult with us mail our CEO, email@example.com. Discover more about what we can do for you on our website https://oviyum.com.