Deep learning for pose estimation of objects

Jacco Zwanepol
Mainblades
Published in
5 min readJun 1, 2018

Artificial Intelligence (AI), more specifically Machine Learning (ML), has made massive leaps in resent years. A big catalyst for this is the ever increasing computing power and the expanding availability of large amounts of training data. The result of this is that technologies we interact with on a day-to-day basis are increasingly making use of these technological advances. Blurring the line between what we consider reality and science-fiction, technologies such as the Google Translate App already translates texts by simply pointing a phone’s camera on any sign or menu and overlaying the translations directly over the text.

On the Go trainslation with google translate

The beauty of Machine Learning is that it enables computers to learn on their own. Feeding the learning algorithm sufficient amounts of data enables it to identify patterns in the data, build models, and make predictions without requiring predetermined rules or models.

Convolutional Neural Networks (CNNs) are a big part of what makes machine learning so powerful. CNNs are designed specifically for taking images as input and performing tasks like classification and object detection. At Mainblades Inspections (MBI) this technology was utilized to explore its potential in combined object detection, classification and 3D pose estimation.

The designed CNN has the following structure, making use of convolutional layers, max pooling layers and a pass-through layer.

YOLOv2 architecture

The convolution layer performs a simple filtering operation on the input image. In the YOLO architecture 3x3 kernels (features) are used to convolve the image. These kernels are applied to the image in a sliding window pattern, multiplying the kernel weights with each pixel in the image and summing these values into a convolved feature.

Convolution operation

The result is a feature map that specifies how well the kernel matches the different windows in the image. In each convolutional layer there are multiple different kernels applied to the image, resulting in a range of feature maps as output. The 1x1 kernels are then used to decrease the number of feature maps by filtering out only the most relevant convolved features. This helps maintain the efficiency of the network allowing for a deeper network architecture.

Example of how two differnet types of features are applied to the input image and what the two resulting feature maps look like.

Both an activation function and batch normalization are applied to the output of each convolution operation. The activation function determines the activation of each kernel on specific regions of the image (windows), specifying its importance to helping solve the objective function. The batch normalisation step is added to introduce a noise to the output of the convolutional layer to prevent overfitting of the network. This ensures that the network is capable of making accurate predictions for data outside of the training examples.

Each convolutional layer generates a series of features that increase as the layers go deeper. The problem with this is that the complexity also increases as the number of features increases. To solve this problem max pooling layers are introduced. These layers decrease the spatial dimension, reducing the dimension by half. The process is very simple, a sliding window of size 2x2 and stride of 2 is used to generate a new feature map that only keeps the max values of the 2x2 window, essentially only keeping the most important information.

Max pooling operation

There are a total of 5 max pooling layers reducing the input image size of 416x416 to an output of 13x13. However, this resolution reduction makes it increasingly more difficult to accurately detect smaller objects. To solve this problem a pass-through layer is introduced. In the pass-through layer higher dimensional feature maps are reorganized to allow the feature maps to be concatenated onto lower dimensional feature maps. Essentially a 26x26 feature map gets restructured into 4, 13x13 feature maps. These are then concatenated onto the output of a 13x13 layer.

Illustration of the restructuring of the feature map in the pass-through layer

The outcome of all these convolutional layers, max pooling layers and pass-through layer is a 13 x 13 x n (P+V+C) tensor. Where the 13x13 grid represent the different regions of the image and the depth (n (P+V+C)) of the tensor represents the predictions made in each cell. Each cell has ’n’ number of predictions, which includes the 11 class probabilities (P), 3 pose parameters (V) and the objectness score or confidence (C).

Output of YOLO_pose Network

Because the output of the network has 13 x 13 x n number of predictions, a post-processing step is required. In this post processing step the correct predictions are filtered out using non-maximum suppression and confidence score normalization. The end result is a state-of-the-art network capable of object detection, classification and pose estimation.

The basic principle of how the YOLO_pose framework analyses images

In order to acquire these results some additional work to the network was required. The network on its own did not perform better than current state-of-the-art techniques. Through the implementation and testing of different learning techniques we were able to achieve these results. Different learning techniques were tested to determine the effect each had on the learning process. The experiments show how the training procedure can have a significant effect on the overall accuracy of the system, particularly data clustering, data selection and normalization of the confidence (this is not a training technique, rather a post-processing step to filter out the correct prediction). Most of these experiments show the importance of data, and how it is applied to the learning process.

Performed training experiments in sequential order

The result is a network that outperforms current state-of-the-art techniques regarding object detection, classification and pose estimation.

Average Viewpoint Precision for azimuth angle for the current state-of-the-art

Most of these techniques only look at the azimuth angle, where our network looks at the full 3D pose (azimuth, elevation and distance of an object)

This technique cannot be directly compared to other techniques as elevation and distance are new variables included in our estimation problem. In the end high accuracies of 40.4 % ACP were achieved.

Average Viewpoint Precision for azimuth, elevation and distance

--

--