BlazePose : A 3D Pose Estimation Model

David Cochard
axinc-ai
Published in
4 min readJun 30, 2021

This is an introduction to「BlazePose」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

BlazePose (Full Body) is a pose detection model developed by Google that can compute (x,y,z) coordinates of 33 skeleton keypoints. It can be used for example in fitness applications.

Source: https://pixabay.com/ja/photos/%E5%A5%B3%E3%81%AE%E5%AD%90-%E7%BE%8E%E3%81%97%E3%81%84-%E8%8B%A5%E3%81%84-%E3%83%9B%E3%83%AF%E3%82%A4%E3%83%88-5204299/

BlazePose input and output

BlazePose consists of two machine learning models: a Detector and an Estimator. The Detector cuts out the human region from the input image, while the Estimator takes a 256x256 resolution image of the detected person as input and outputs the keypoints.

BlazePose outputs the 33 keypoints according the following ordering convention. This is more points than the commonly used 17 keypoints of the COCO dataset.

BlazePose keypoints (Source: https://developers.google.com/ml-kit/vision/pose-detection)

Architecture

The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3), it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied.

There are two ways to use the Detector. In box mode, the bounding box is determined from its position (x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and (kp2x,kp2y), and bounding box including rotation can be predicted.

Source: https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html

The Estimator uses heatmap for training, but computes keypoints directly without using heatmap for faster inference.

Tracking network architecture: regression with heatmap supervision (Source: https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html)

The first output of the Estimator is (1,195) landmarks , the second output is (1,1) flags. The landmarks are made of 165 elements for the (x,y,z,visibility,presence) for every 33 keypoints .

The z-values are based on the person’s hips, with keypoints being between the hips and the camera when the value is negative, and behind the hips when the value is positive.

The visibility and presence are stored in the range of [min_float,max_float] and are converted to probability by applying a sigmoid function. The visibility returns the probablity of keypoints that exist in the frame and are not occluded by other objects. presence returns the probablity of keypoints that exist in the frame.

Usage

Use the following command to run BlazePose (Full Body) with ailia SDK.

$ python3 blazepose-fullbody.py -v 0

Here is a result on a sample video. The size of the circles at keypoints indicates the z-value.

The BlazePose (Upper Body) can also be used to estimate only the upper body. Initially, MediaPipe released only the upper body model, and later the full body model . The specifications of the full body and upper body models are different, for example, the detector resolution is 128x128 for the upper body model.

$ python3 blazepose.py -v 0

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR