Published in


BlazePose : A 3D Pose Estimation Model

This is an introduction to「BlazePose」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.


BlazePose (Full Body) is a pose detection model developed by Google that can compute (x,y,z) coordinates of 33 skeleton keypoints. It can be used for example in fitness applications.


BlazePose input and output

BlazePose consists of two machine learning models: a Detector and an Estimator. The Detector cuts out the human region from the input image, while the Estimator takes a 256x256 resolution image of the detected person as input and outputs the keypoints.

BlazePose outputs the 33 keypoints according the following ordering convention. This is more points than the commonly used 17 keypoints of the COCO dataset.

BlazePose keypoints (Source:


The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3), it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied.

There are two ways to use the Detector. In box mode, the bounding box is determined from its position (x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and (kp2x,kp2y), and bounding box including rotation can be predicted.


The Estimator uses heatmap for training, but computes keypoints directly without using heatmap for faster inference.

Tracking network architecture: regression with heatmap supervision (Source:

The first output of the Estimator is (1,195) landmarks , the second output is (1,1) flags. The landmarks are made of 165 elements for the (x,y,z,visibility,presence) for every 33 keypoints .

The z-values are based on the person’s hips, with keypoints being between the hips and the camera when the value is negative, and behind the hips when the value is positive.

The visibility and presence are stored in the range of [min_float,max_float] and are converted to probability by applying a sigmoid function. The visibility returns the probablity of keypoints that exist in the frame and are not occluded by other objects. presence returns the probablity of keypoints that exist in the frame.


Use the following command to run BlazePose (Full Body) with ailia SDK.

$ python3 -v 0

Here is a result on a sample video. The size of the circles at keypoints indicates the z-value.

The BlazePose (Upper Body) can also be used to estimate only the upper body. Initially, MediaPipe released only the upper body model, and later the full body model . The specifications of the full body and upper body models are different, for example, the detector resolution is 128x128 for the upper body model.

$ python3 -v 0

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store