POSE ESTIMATION — Based Action Recognition for Help Situation Identification

Pose Estimation is the localization of human joints, or say key points — elbows, knee, hip etc. in images or videos. Specific pose identification in space of all articulated poses can also be termed as Pose Estimation.

Pose estimation involves the identification of various actions from video and images where different action performed. Small and hidden joints, obstructions, clothing, dense articulation and background conditions make pose estimation a difficult problem. Our model is mainly designed and trained to find people stuck in difficult situations like earthquakes, floods and many others and to rescue them in less time by identifying their captured actions via drone survellience. Actions are waving hand towards drone camera, lying on ground and waving while laying etc.

Pose estimation becomes more difficult in multiple person pose detection than the single person case as the location and the number of people in an image are unknown. Two approaches are there for multi-person pose estimation:

  • First approach is to detect every person first and then allocate all key points for every person. This method is called the top-down method.

Second method is to detect all body parts of every person followed by mapping the parts belonging to each person. This method is called the bottom-up method.

How pose estimation works on the edge

If you want that pose estimation work in real-time, without internet connection, you have to run your pose estimation model on a device like a mobile phone.

In those cases, you will need to choose a specific model to make sure that everything runs smoothly on these less power devices. Here are some tips to ensure that your models are ready for edge deployment:

· Use Mobile Net-based architecture for your encoder. This architecture uses layer types like depthwise separable convolutions which require fewer parameters and less computation to provide solid accuracy.

· Add a width multiplier to your model so you can adjust the number of parameters in your network to meet your computation and memory constraints. The number of filters in a convolution layer, for example, greatly impacts the overall size of your model. Many papers and open-source implementations will treat this number as a fixed constant, but most of these models were never intended for mobile. Adding a parameter that multiplies the base number of filters by a constant fraction allows you to modulate the model architecture to fit the constraints of your device. For some tasks, you can create much, much smaller networks that perform just as well as large ones.

· Shrink models with quantization, but beware of accuracy drops. Quantizing model weights can save a bunch of space, often reducing the size of a model by a factor of 4 or more. However, accuracy will suffer. Make sure you test quantized models rigorously to determine if they meet your needs.

· Input and output sizes can be smaller than you think! If you’re designing an app and think that your pose estimation model will work on full resolution video frames. In most cases, devices will not have that much processing power to handle this. It’s common to train your pose estimation models to take videos in small 224x224 pixels.

As of now we are working on two models one is based on pose estimation using Pytorch implementations of HRNet model and second one is OpenPose model.

OpenPose Model:

· Bottom-up approach.

· Body points detection and final parsing to extract pose.

· running time complexity independent of the number of people in the image.


· First, the image is passed through a baseline network(VGG-19 model) to extract feature maps.

· Then, the feature maps are processed with multiple stages CNN to generate: 1) a set of Part Confidence Maps and 2) a set of Part Affinity Fields (PAFs)

· Part Confidence Maps: a set of 2D confidence maps S for body part locations. Each joint location has a map.

· Part Affinity Fields (PAFs): a set of 2D vector fields L which encodes the degree of association between parts.

· Finally, the Confidence Maps and Part Affinity Fields are processed by a greedy algorithm to obtain the poses for each person in the image.

Here are some result of Openpose:

HRNet Model:

  • Bottom-up approach
  • State of the art model
  • Uses high to low resolutions subnetworks connected in parallel to maintain high resolution through the entire process.

Dataset: We have use two standard datasets and one self-created dataset in our project:

  1. COCO
  2. MPII
  3. create our own dataset of various actions like single hand waving, waving both hands, person lying on the ground, people fighting etc.

Having created our own dataset gives an added advantage of having the background of our country in the images which might yield better results when the model is deployed in the country.


Here are some results of HR.NET models:

Testing accuracy for our trained model is 87.6% .

Laying-waving pose:

Fighting pose:

Hand shaking pose: