Human pose estimation via deep neural networks

Shreyaadaga
4 min readMay 15, 2024

In the realms of artificial intelligence and computer vision, human pose estimation emerges as a particularly captivating branch of exploration. It’s a field that not only unlocks insights into our physical capabilities but also offers a glimpse into our behaviors and interactions with the world around us. At its essence, human pose estimation endeavors to unravel the spatial arrangement of the human body from images or videos, discerning key points such as joints, limbs, and body parts. These insights furnish invaluable information about posture, movement, and gestures, unlocking doors to a myriad of applications across diverse domains.

Various human poses

Deep learning (DL) offers a transformative solution to the challenges faced by traditional methods in human pose estimation. Unlike traditional approaches reliant on handcrafted features and predefined models, DL algorithms can automatically learn complex patterns and relationships directly from raw data. By leveraging large-scale annotated datasets, DL models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), excel in capturing intricate spatial configurations and articulations of human body parts.

At its core, DeepPose adopts a novel approach by treating human pose estimation as a regression task. Unlike traditional methods that rely on handcrafted features or predefined models, DeepPose harnesses the immense capabilities of Deep Neural Networks (DNNs) to autonomously learn and deduce the spatial arrangement of the human body directly from raw image data. This fundamental shift in perspective empowers DeepPose to capture the subtle intricacies and complex relationships between body joints and their spatial coordinates, facilitating a more nuanced and adaptable form of pose estimation.

The various datasets that can be used for training include Frames Labelled In Cinema (FLIC) [ 10 upper body joints labeled] and Leeds Sports Dataset [14 full body joints labeled]. The basic architecture of deep neural network for human pose estimation can be understood as below:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Define the DeepPose-inspired model
model = Sequential([
# Convolutional layers
Conv2D(96, kernel_size=(11, 11), activation='relu', input_shape=(220, 220, 3)),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(256, kernel_size=(5, 5), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(384, kernel_size=(3, 3), activation='relu'),
Conv2D(384, kernel_size=(3, 3), activation='relu'),
Conv2D(256, kernel_size=(3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),

# Flatten the output of the convolutional layers
Flatten(),

# Fully connected layers
Dense(4096, activation='relu'),
Dense(4096, activation='relu'),

# Output layer with 2k joint coordinates (assuming k keypoints)
Dense(2 * k, activation='linear')
])

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# Model summary
model.summary()

The initial layer accepts an image of predetermined dimensions, with a size equivalent to the total number of pixels multiplied by three colour channels. Conversely, the final layer produces the desired regression output, consisting of 2k joint coordinates in this scenario.

At the heart of DeepPose lies its designed architecture, crafted to optimise performance and accuracy. The network architecture, denoted as

C(55 × 55 × 96) − LRN − P − C(27 × 27 × 256) − LRN − P − C(13 × 13 × 384) − C(13 × 13 × 384) − C(13 × 13 × 256) − P − F(4096) − F(4096),

comprises various layers tailored for specific tasks. Convolutional layers © extract features from the input image, local response normalization (LRN) layers enhance feature discrimination, pooling layers (P) downsample feature maps to reduce computational complexity, and fully connected layers (F) perform high-level feature integration.

The initial two convolutional layers utilize filter sizes of 11 × 11 and 5 × 5, respectively, while the subsequent layers employ a filter size of 3 × 3. This progressive reduction in filter size allows the network to extract increasingly intricate features from the input image, facilitating more accurate and robust pose estimation. DeepPose’s architecture incorporates pooling layers after every three convolutional layers. Pooling layers serve to enhance performance by reducing the resolution of feature maps while preserving essential spatial information. The network’s input is an image sized at 220 × 220, which is fed into the network via a stride of 4.

During training, the DNN learns to iteratively adjust its internal parameters (weights and biases) through a process known as backpropagation, minimizing the difference between the predicted joint coordinates and the ground truth coordinates provided in the training data. This optimization process allows the DNN to gradually improve its ability to accurately predict joint coordinates from input images.

Deep neural network architecture

Deep neural networks have revolutionized human pose detection, yet they are not without limitations. The complexity of these models demands significant computational resources, which can be a barrier to real-time processing or deployment on less powerful devices. Their performance hinges on large volumes of high-quality training data, and any lack thereof can lead to inadequate generalization. There’s also a risk of overfitting the training data, which can diminish the model’s ability to predict new poses accurately.

DNNs can struggle with occlusions and the inherent variability in human shapes and postures. Moreover, the “black box” nature of these networks poses challenges for interpretability, which is crucial for applications that require a clear understanding of decision-making processes. Additionally, the cost associated with the necessary equipment and the inconvenience of sensor attachment for data collection can be prohibitive, restricting these advanced technologies to well-funded research institutions and making them impractical for continuous monitoring in everyday settings.

--

--