Human Pose Estimation and Human Action Recognition: Experimenting for public good

Delon
AI Practice GovTech
7 min readFeb 13, 2020

If you have played console games with Microsoft Kinect before, you might wonder how Microsoft Kinect is able to recognise your actions accurately?

The answer lies primarily in the software. The Microsoft Kinect software is able to locate your skeletal joints in a 3D space using its RGB camera and infra-red depth sensors. By monitoring the movement of your skeletal joints, Kinect is then able to accurately interpret your actions for the game play. Beyond video games, the ability to locate human’s skeletal joints’ positions in a 3D space and recognising the movements of these points bring about many other useful applications, such as posture correction, recognising human actions, translating sign languages and many more.

So, this got us thinking. Over the past few months, we at the Data Science and Artificial Intelligence Division (DSAID) in GovTech have been exploring how to computationally recognise basic human actions by analysing the locations and movements of human skeletal joints using the simplest hardware possible (i.e. RGB camera). If we were able to recognise basic human motions or actions using AI, we will then be able to build novel systems that address a new bunch of use cases.

Human Pose Estimation

The problem that we are trying to address here is called human pose estimation. It is an active research field amongst the computer vision and the open source communities.

In simple terms, a human pose estimation model takes in an image or video and estimates the position of a person’s skeletal joints in either 2D or 3D space. Luckily for us, there are resources available today that explain the concept of human pose estimation in a simple and concise manner. We have done a fair amount of research and self-learning to pick up the necessary knowledge of this topic, and began to build our own custom pose estimation solution.

Like most computer vision problems today, the state-of-the-art approach towards the pose estimation problem is to use a deep learning network called Convolutional Neural Network (CNN). A CNN model is the backbone of any AI-enabled video analytics solution and it needs to be trained using hundreds of thousands of annotated datasets before it can be of any use. There are many open source annotated datasets that support the development of CNN models. One example is the COCO 2019 keypoints challenge dataset, which is a popular dataset used by developer and computer vision enthusiasts to develop their own custom pose estimation models.

Image Source: COCO Keypoint Detection Task dataset

Experimenting with Human Action Recognition

With some basic know-how, we dipped our toes in the water and adopted a human pose estimation model to try to implement a human action recognition solution of our own. The solution we built takes in a video feed, which is processed following the three sequential steps as illustrated in the following diagram.

How we see pose estimation and human action recognition process

Step 1: Video Frame to 2D Human Keypoint

First, we used a pose estimation model to extract 2D skeletal joints (as known as human keypoints) from video frame sequences. We chose the High-Resolution Network (HRNet) 2D pose estimation model as the core model as this model achieved the best performance on the COCO 2019 Keypoint Detection Task dataset. Check out Synced’s article for a clear and concise explanations of HRNet.

Estimating 2D Human Keypoints

Step 2: 2D Human Keypoint to 3D Human Keypoint

Next, we did a 2D-to-3D human keypoints mapping using a method proposed by Pavllo et al. A 3D representation of human keypoints is more accurate than a 2D representation as the former better represents a person’s body structure in real-world condition. Analysing the motion of the 3D human keypoints will give a more accurate action recognition prediction. following diagram shows the result of converting 2D human keypoints to 3D human keypoints.

Estimating 3D Human Keypoints from 3D Human Keypoints

Step 3: 3D Human Pose Estimation to Human Actions

Finally, we applied simple heuristic to identify salient actions through the observation of the keypoints’ motion. For example, we know which keypoint represents the human head and we can detect if a person is falling down by observing the movement of the head’s keypoint location along the y-axis.

Estimating 3D Human Keypoints from 3D Human Keypoints

Demonstration in our Laboratory

We have a mobile CCTV camera in our office and we thought it would be a good idea to use it to test out our solution. We set the camera up, connected it to our human action recognition solution.

Our mobile CCTV camera in our office 😃

Walking

Someone walking along the corridor

Riding

Someone was riding a kick scooter along the corridor

Falling

Someone is so excited to come to work

Smoking

Someone pretended to smoke along the corridor …

Applications

The action recognition solution performed well in our lab testing. Inspired by the promising results, we thought of applying this solution to two of our existing use cases — one involving fall detection and the other involving smoking detection. Both use case scenarios require the action recognition solution to have near real-time responses, and be able to run on a resource-constrained edge device. We tried running the full action recognition solution, which includes the compute-intensive HRNet and 2D-to-3D mapping model (Pavllo et al.), on a few edge-computing platforms (we tried platforms from NVIDIA, INTEL and XILINX). We have tried running on platforms from NVIDIA, INTEL and XILINX, and we will share our experience in another post in the near future. We saw that the resource requirement for the original action recognition solution was too much for these edge-computing platforms that typically come with 4GB of GPU vRAM. To circumvent this issue, we used just the 2D keypoints mapping and did away with the 2D-to-3D mapping. Our testing showed that the use of only 2D keypoints map still produced accurate action recognition predictions.

Our test bench consisting of [TOP]Intel Mustang-F100 (inside TANK AIoT Developer Kit) , [BOTTOM-LEFT] Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit and [BOTTOM-RIGHT] Nvidia Jetson TX2 (inside BOXER-8110AI) edge devices

Fall Detection System

One of the most widely used applications for human pose estimation is detecting falls. A simple Google search on “fall detection with pose estimation” will yield thousands of results. However, we observed that most of the fall detection implementations are catered to an indoor environment. We were not sure if such implementation would work in an outdoor environment. So we thought of putting our action recognition model to the test.

Fall detection system

The action recognition solution with the HRNet base model is used to implement the Fall Detection application. We used the same process as PINTO0309, where we tracked the location of the human’s head in a 2D coordinate space. Tracking the location of the human’s head in 2D coordinate space allows us to estimate both its acceleration and trajectories. We can use this information to infer potential falls. We achieved success in our first trial and our efforts are publicly featured here.

Smoking Detection System

Smoking detection process

We also used the action recognition solution to build a smoking detection application. In our solution design, we cascaded the base HRNet pose estimation model with an image classifier model. We were able to locate a person’s face region using the HRNet pose estimation model. We then cropped the person’s face images and passed them into a trained image classifier model. The image classifier model then analyses the images to decipher if the person is smoking or otherwise. For this image classifier model, we used ResNet50.

Our prototype was built successfully and was featured in public events such as Smart Nation and U 2019.

Our demonstration of our 2D pose estimation system running on as simple RGB webcam in real time during Smart Nation & U in December 2019

While the solution is working well now, we consider this computationally intensive — for it requires the operation of two CNN models. We are exploring for new viable models where we can determine if an individual is smoking purely using pose estimation method.

What’s next …

In DSAID, We are continuously keeping abreast of new technical developments in the computer vision space and finding ways to improve our existing work. In this case, we are looking to improve our fall detection and smoke detection solution’s accuracy without increasing the computation resource requirement. In addition, we are also looking at how to apply process to sports and workplace safety use cases. Be it technical trial or solution development, we always start with the end in mind. Such an end is for us to understand and adopt the newest computer vision techniques, and to develop innovative and niche products that can help public agencies serve Singaporeans better.

By the way, we are hiring!

If you like what you read and want to contribute to our effort, please get in touch with us! We are looking for

  • AI Engineer (Video Analytics)
  • Product Manager

Get in touch with the team by sending your CV over to

recruit@dsaid.gov.sg

--

--