[Paper review] D4LCN : Learning Depth-guided Convolutions for monocular 3D object detections

Damin Lee
6 min readOct 11, 2021

--

Paper : https://arxiv.org/abs/1912.04799

Introduction

  • pseudo-lidar based 3d object detection은 pseudo-lidar depth 결과가 부정확할 때 3d object detection의 성능이 저하되는 단점이 있다.
  • 반면 D4LCN은 depth point 결과가 부정확할때도 3d object detection 결과가 준수하다. (figure 1)
  • “Image” based vs “Lidar” based 3d object detection
    Image에선 depth 정보가 없기 때문에, 원거리와 근거리에 있는 객체를 구분할 수 없고, 거리차에 따라 객체의 scale이 차이나게 된다. 그리고 2d Convolution에서 object와 background는 구분되지 않은채 계산이 된다.
    Lidar에선 sematic segmentation 정보를 잃어버리고, 촬영 환경 조건에 depth 성능이 의존되어 3d object detection 성능 또한 depth 결과에 영향을 받게 된다. 그리고 Lidar는 비싸고 sparse한 depth 결과가 나온다.
  • Our D4LCN
    전체 이미지를 global kernel로 학습시키지 않고, depth map과 각 pixel과 channel의 local 정보를 가지고 학습한다.
    Examplar kernel(sample-wise) : learn specific scene geometry for each image
    Local convolution(point-wise) : distinguish object and background regions for each pixel
    Depthwise convolution(depthwise) : learn different channel filters in a convolutional layer with different purposes and to reduce computational complexity
    Exampler dilation rate : learn different receptive fields for different filters to account for objects with diverse scales
  • Our contributions
    1) D4LCN : depthmap과 monocular image을 이용한 새로운 3D object detection
    2) 3D 표현에 있어 2d convolution과 point cloud-based 방법의 성능 격차를 줄이는 single-stage 3D object detection
    3) D4LCN은 SOTA monocular 3d detection methods, KITTI benchmark 1st place.

Methodology

Backbone

This model has two independent backbones.

a. Feature extraction network
- input type : RGB image
- backbone : Resnet50 without its final FCs and pooling layers, block4 set stride 1 and replace all conv layers with dilated conv layers.

b. Filter generation network
- input type : estimated depth map
- backbone : the first three blocks of Resnet50

Depth-Guided Filtering Module

feature extraction network의 kernel은 depth map에 의해 학습되어 생성되므로, 뎁스 정보와 같은 지역 구조 의미를 담는다.

  • Depth-wise local convolution(DLCN)
    Depth-wise local convolution(DLCN)은 global 의미를 담는 DCN과 다르게 local 의미를 담게 학습되는 구조이다.
    입력 위치가 shift되는 방식으로 되어있고 아래 수식과 같다.
    (k : kernel size, n : the index of feature map layer)
  • Shift-pooling operator
    DLCN의 각 포인트에서 채널간 정보교류를 하기 위해 pointwise shift pooling 후 elementwise mean을 한다.
    만약 nf가 3일 때, [1,2,…,n], [2,3,…,n,1], [3,4,…n,1,2] 3개의 shifting된 pointwise레이어를 elementwise mean을 하게된다.
    특히 이 기술의 장점은 기존 group convolution layer보다 computational cost가 낮다. 추가적인 파라미터가 필요한 연산이 아니면서도, 각 채널간 정보를 교류할 수 있게 해주는 장점이 있다.
  • Guided filtering with adaptive dilated function
    DLCN에서 필터를 독립적으로 생성하기 때문에, 각 필터의 입력 receptive field scale에 맞게 dilation ratio을 결정할 수 있다.
    각 픽셀에 다른 kernel이 적용되고, channel에 따라 dilation ratio가 바뀌는 점(adaptive dilation function) 덕분에, RGB 영상 전체를 사용하게 되고 scale에 민감한 2D convolution 문제를 해결하였다.

2D-3D Detection Head

  • Formulation
    A. GT
    - 2D bounding box : [x,y,w,h] — (x,y) is center of 2d box
    - 3D center : [x, y, z]
    - 3D shapes : [w, h, l] — (w,h,l) is width, height and length
    - Allocentric pose : α — [-π,π]
    B. Outputs
    - The number of parameters : 35 + n꜀
    (n꜀ : the number of classes)
    - The size of outputs : h₄ * w₄ * nₐ *(35 + n꜀)
    ( nₐ : the number of anchors)
    - 35 + n꜀ parameters
Each anchor contains {35 + n꜀} parameters
  • 2D-3D Anchor
    Designed on the 2D space.
    Defined using parameters of both spaces:
A
  • For 2D Anchors
    12 different scales ranging 30 to 400 pixels ( 30 * 1.265^exp , exp ∈[0,11])
    aspect ratio : [0.5, 1.0, 1.5]
    → total 36 anchors
  • For 3D Anchors
    project all GT 3D boxes to the 2D space
    For each box, calculate IoU with 2D anchor and assign the corresponding 3D box to the anchors that have an IoU ≥ 0.5
  • Data Transformation
    anchor-based transformation of the 2D-3D box
Data transformation using anchor parameters
  • Losses
    Contains classification loss, 2D regression loss, 3D regression loss and 2D-3D corner loss
    Use Focal loss to balance the samples
Overall loss of D4LCN
Classification loss which is used the standard cross-entropy loss
2D and 3D regression loss which is used SmoothL1 loss

Experiments

Dataset and Setting

  • KITTI 3D object detection dataset
    Consists of 7481 training images and 7518 test images
    Total of 80,256 2D-3D labeled objects with three object classes(Car, Pedestrian, Cyclist)
    Each 3D GT is assigned to one out of three difficulty classes(easy, moderate, hard)
    Includes three task : 2D detection, 3D detection and Bird’s eye view
  • Evaluation Metrics
    Precision-recall curves are used for evaluation (IoU ≥ 0.7)
    AP|ᵣ₁₁ : 11-point Interpolated Average Precision metric
    AP|ᵣ₄₀ : 40 recall positions-based metric

Comparative Results

top 14 monocular methods in the leaderboard, among which our method ranks top-1.

  1. Outperforms the second-best competitor “AM3D” by a large margin
  2. Most competitors utilize the detector pretrained on COCO/KITTI or multi-stage training to obtain better 2D detection and stable 3D results.
    While our model is trained end-to-end using ImageNet pre-trained model

Evaluation of Depth Maps
Using four different DE methods and apply them to 3D detection.
Three supervised methods (PSMNet, DispNet, DORN) are better than the unsupervised mehod(MonoDepth).
Among the supervised methods, Stereo-based methods(PSMNet, DispNet) are better than monocular-based DORN.

  1. The accuracy is higher with better depth map
    - better depth map provides better scene geometry and local structure
  2. As the quality of depth map increases, the growth of detection accuracy becomes slower
  3. Even with the depth maps obtained by unsupervised learning, our method is SOTA results.
    - our method relies less on the quality of depth maps.

Evaluation of Convolutional Approaches

Show the effectiveness of our “Depth-Guided Filtering Module” for 3D object detection.

Multi-Class 3D Detection

Since person is a non-rigin body, it is hard to accurately estimate depth information.(Pedestrian, Cyclist)
All pseudo-LiDAR based methods fail to detect two categories, however D4LCN still achieves satisfactory performance.

And show the active maps corresponding to different filters of D4LCN.
Different filters on the same layer of our model use different sizes of receptive fields to handle objects of different scales.

Conclusion

The dynamic kernels of D4LCN are generated dynamically conditioned on the depth map.
As a result,
1. address the problem of the scale-sensitive and meaningless local structures of 2D convolusions
2. benefit from the high-level semantic information from RGB images

#deeplearning

--

--