Paper review- Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation

yw_nam

Published in

Analytics Vidhya

6 min readAug 3, 2020

Every Figures, Tables are come from the paper. (Marked if it is from another paper or other website.)

Content

Abstract
Method
Result and Experiments
Conclusion
My Opinion

1.Abstract

Fig 1. Illustration of cylindrical convolutional networks

This paper is accepted by CVPR 2020.

The authors say that many CNN models deal with geometric deformation through visible features such as data augmentation, large model capacity, and spatial transformation. The authors also argue that viewpoint changes occur in 3D space, while many CNN models deal with it in 2D space. Therefore, they argue that models have limited ability to handle large geometric transformations (e.g., object scale, viewpoint, part deformations).

According to this paper, to solve the problems of viewpoint variations, it has recently attracted interest to estimate simultaneously viewpoint and object detection using CNN. There is a papers that first represent a 3D object with a set of 2D rendered images, extract the features of each image from different viewpoints, and then aggregate them for object category classification. But, Obviously, this is not practical. This is because in practical problems, 2D images and 3D images with different viewpoints are often not given.

Therefore, the authors propose cylindrical convolutional networks (CCNs). The main idea of this article is that it is as follows.

The key idea is to extract the view-specific feature conditioned on the object viewpoint (i.e., azimuth) that encodes structural information at each viewpoint as in 3D object recognition methods.
we present a new and differentiable argmax operator called sinusoidal soft-argmax that can manage sinusoidal properties of the viewpoint to predict continuous values from the discretized viewpoint bins.

The authors argue that this improve object detection and viewpoint estimation.

Currently(2020–8–4), the official code for this model has not been released yet.

2. Method

2–1. Problem statement and Motivation

Fig 2. Intuition of cylindrical convolutional networks

First, the purpose of this model is to predict object categories and viewpoints for a given photo, that is, in 2D space. N_c and N_v are the number of object classes and viewpoints, respectively. and the value of c and v are different depending on the number of classes in the dataset and the number of viewpoints, respectively.

As shown in Fig. 2, in case of (a), the category classifier and viewpoint classifier operate independently, so they do not improve each other’s performance. Also, since (b) requires a multi-view image, there is a problem in practical application. In case of (c), it estimate object category likelihood by extracting view-specific features through cylindrical convolutional kernel at each viewpoint and select viewpoint kernel that maximize object category likelihood

Note.

Since viewpoint is a continuous variable, you may ask, “this problem isn’t regression problem?” However, the authors argue using this paper that the regression approach cannot represent the ambiguities well that exist between different viewpoints of objects with symmetries or near symmetries

2–2. Cylindrical Convolutional Networks

Fig 3. Key idea of cylindrical convolutional networks

Eq 1. Calculate F_v

The authors say that view-specific feature benefit from structural similarity between nearby viewpoints.

2–3. Joint Category and Viewpoint Estimation

Category Classification.

Eq 2. Category classification for each class c

The authors argue that Eq 2. is a formula for calculating the score of c, but because the term for v is included, the gradient from S_c corrects the probability of viewpoint.

Viewpoint Estimation

Next, in the case of viewpoint estimation, it seems that you can select the view-specific feature that shows the best performance in recognizing the object category. The authors argue that we can enabling regression P_(v,c) through sinusoidal soft-argmax because viewpoints has periodic characteristics. Therefore, object viewpoint estimation for each class c is calculated as follows.

Eq 3. viewpoint classification for each class c

where, sin (i_v) and cos (i_v), extracted by applying sinusoidal function to each viewpoint bin i_v (i.e. 0°, 15°, for N_v = 24).

Bounding Box Regression

The authors apply additional convolutional layers for bounding box regression with Wreg to produce N_v × N_c × 4 bounding box offsets, where t_(v,c) = f(F_v;W_reg). Each set of 4 values encodes bounding box transformation parameters from initial location for one of the N_v ×N_c sets. This leads to use different sets of boxes for each category and viewpoint bin

Loss Functions

So, full loss function calculated as follow:

Eq 4. Full loss function.

where ’^(hat)’ means ground truth. The Iverson bracket indicator function [·] evaluate 1 if true, otherwise 0. For background, c_hat = 0, there is no ground-truth bounding box and viewpoint, hence L_reg and L_view are ignored. If the datasets without viewpoint annotation (θ_hat = ∅), L_view is ignored and viewpoint estimation task is trained in an unsupervised manner. In the opposite case, viewpoint estimation task is trained in an supervised manner. They use crossentropy for L_cls, and smooth L1 for both L_reg and L_view,

3. Result and Experiments

Data.

The authors used the Pascal 3D+ and KITTI dataset. For detailed description, please refer to the paper and link.

Experiments

Table 1. Joint object category and viewpoint estimation performance

Table 1. shows the performance of the Pascal 3D+ dataset. As you can see, N_v = 24 (with CCN) gives the best performance.

Fig 5. Visualization of learned deep feature through Grad-CAM

Fig. 5 shows input, attention map (without CCN), and attention map (with CCN), respectively, from top to bottom. As you can see, the attention map for the object is extracted well.

Table 2. Comparison of object detection on Pascal 3D+ dataset

Table 2. Compares the performance with other models. Except for some categories, the authors’ model shows the best performance.

Fig 6. Qualitative examples of joint object detection and viewpoint estimation on Pascal3D+ dataset

Fig 7. Qualitative examples of joint object detection and viewpoint estimation on KITTI dataset

Fig 6. , Fig 7. shows the results of object detection and viewpoint estimation. At this time, green is prediction and black is ground-truth.

Conclusion

The authors propose a model that simultaneously performs viewpoint estimation and object detection through cylindrical convolutional networks (CCN). The key-idea that the authors say is as follows:

The key idea is to exploit view-specific convolutional kernels, sampled from a cylindrical convolutional kernel in a sliding window fashion, to predict an object category likelihood at each viewpoint
With this likelihood, we simultaneously estimate object category and viewpoint using the proposed sinusoidal soft-argmax module.

My opinions

I think viewpoint estimation is very attractive topic. Because it can be applied several domain(e.g data augmentation, improve performance). However, i doubt this model will be able to distinguish the following pictures.

Fig 8. Two doll pictures with different viewpoint

Fig 8. has almost same horizontal axis, but definitely different vertical axis. If my understanding of this paper is correct, I think that the model give the same output to these photos although these photo has different vertical viewpoint. Anyway, still it looks very impressive work.