MOVING FROM 2D INFERENCE TO TO 3D ESTIMATION ON HUMAN POSTURE

Nguyen D. Le
Sifu.art
Published in
5 min readSep 27, 2022

Just by glancing at a photo, the human brain can internally model the 3D posture of a ballet dancer — and from that model, it can judge the professional level of the person performing that posture.

Do you think this is professional or amateur?

It all boils down to our past experience and the surrounding key points of the setting the photo. The same process was successfully brought to deep learning thanks to a specific type of neural network called Stacked Hourglass Network that can infer the location of human joints in 3D space from one photo.

See the shape of the network? They look just like many hourglasses stacked all together

Many outstanding works have been carried out based on 2D human photo datasets such as Human3.6M and MPII. Deep learning is all about data, and just recently, Carnegie Mellon University released a breakthrough dataset, Panoptic, which was captured holistically as you can see in the bellow photo.

Why is the dataset called Panoptic? You can guess …

Version 1.0 of our server trained the network to judge human posture based on only ONE photo taken from the phone’s camera.

Our Version 1.0 Server

The model runs great if all the key joints are visible from the phone camera’s point of view. However in a production environment, this is hardly the case, some joints are always blocked from view.

To achieve the same result as multi-cameras, we require that the user stand still and his friends hold the phone, the camera facing the user, going at least 270 degrees around him, making sure that no body parts information is missing from the input.

Multi View Pose Estimation

Setup the Training

We spent the last week setting up the Deep Learning machine. AI is a trade, there are many small steps involved so I will list the basic steps here.

Get some GPU power (the more the better)

As a start-up, we always need to be over-cautious when dealing with GPU costs. I used to pay more than 5000USD / month out of my pocket just because of our recklessness. I am based out of Toronto Public Library Entrepreneur Lab, they don’t give us cloud credits, so I decided to ask for help from these (Virtual) Incubators:

  • Google
  • Digital Ocean
  • Amazon
  • Nvidia
  • Pioneer

I finally got accepted to AWS for a few grants of cloud credit, just enough for the short term. It’s time to sweat.

Download Data

Sound like easy eh? It’s not! The Panoptic dataset was streamed directly from 480 VGA cameras, 31 HD cameras and10 Kinect II sensors. There is NO way to download it with a Wget or cURL command.

I always spin off an instance of Amazon EC2 with a 50GB boot partition, it usually would be enough for most tasks. This time luck was not with me, since I had to resize the SSD four times so that it will fit the dataset downloaded. I ended up using a 500GB root partition just for the dataset. I have to mention that it only dealt with the HD streams from 5 HD Cameras, from 13 (out of 65) sequences.

Just in case you need to hot resize your root partition, follow these steps:

Go to AWS console and resize the root partition volume (from 50GB to 500GB)lsblk (to see which partition needs to be resized, in my case it was /dev/xvda1)growpart /dev/xvda 1 (/dev/xvda is the volume, 1 is the partition number)resize2fs /dev/xvda1

This is my script to download the part of Panoptic dataset in interest:

#!/bin/bashgetData='./scripts/getData.sh'$getData 160422_ultimatum1 0 5
$getData 160224_haggling1 0 5
$getData 160226_haggling1 0 5
$getData 161202_haggling1 0 5
$getData 160906_ian1 0 5
$getData 160906_ian2 0 5
$getData 160906_ian3 0 5
$getData 160906_band1 0 5
$getData 160906_band2 0 5
$getData 160906_band3 0 5
$getData 160906_pizza1 0 5
$getData 160422_haggling1 0 5
$getData 160906_ian5 0 5
$getData 160906_band4 0 5
$getData 160906_band3 0 5$getData 160906_pizza1 0 5#!/bin/bashextractAll=’./scripts/extractAll.sh’$extractAll 160422_ultimatum1&& \
$extractAll 160224_haggling1 && \
$extractAll 160226_haggling1&& \
$extractAll 161202_haggling1&& \
$extractAll 160906_ian1&& \
$extractAll 160906_ian2&& \
$extractAll 160906_ian3&& \
$extractAll 160906_band1&& \
$extractAll 160906_band2&& \
$extractAll 160906_band3&& \
$extractAll 160906_pizza1&& \
$extractAll 160422_haggling1&& \
$extractAll 160906_ian5&& \
$extractAll 160906_band4

$extractAll 160906_band3&& \

$extractAll 160906_pizza1&& \

$extractAll 160422_haggling1&& \

$extractAll 160906_ian5&& \

$extractAll 160906_band4

I left it overnight and woke up to see everything was nicely done. Afterward, I ran this script to extract the HD videos to .jpg files:

Dry run to make sure the dataset is good

I use VoxelPose, a proven model from the Microsoft Research team to make sure that everything is working with the 500GB of data that I just downloaded. First I

  1. Cloned it from github:

git clone https://github.com/microsoft/voxelpose-pytorch.git

  1. Install conda and python 3.7

conda create -n python37 python=3.7

  1. Install requirements: don’t do a “pip install -r requirements”, it will mess up your code. Follow these steps:

conda activate python37

conda install -y pytorch=1.4.0

conda install -y numpy scipy matplotlib tensorboardX tqdm opencv

pip install json_tricks opencv_python prettytable easydict PyYAML

  1. Run the default training:

python run/train_3d.py — cfg configs/panoptic/resnet50/prn64_cpn80x80x20_960x512_cam5.yaml

While training, open nvtop, if it looks like this, you’re good to go!

Next: Build my model and make it works with the application input (short video clip).

--

--