Review — Pelee: A Real-Time Object Detection System on Mobile Devices
Pelee = Proposed PeleeNet as Backbone + Modified SSD as Object Detection Network (Image Classification & Object Detection)
In this story, Pelee: A Real-Time Object Detection System on Mobile Devices, (Pelee & PeleeNet), by University of Western Ontario, is reviewed. In this paper:
- PeleeNet is proposed which use conventional convolution only, and it is a variant of DenseNet architecture for mobile devices.
- Pelee, using PeleeNet as backbone and modified SSD as object detection network, becomes a fast speed object detector. (It’s just like YOLO object detection networks, they use DarkNet as backbone.)
This is a paper in 2018 NeurIPS with over 200 citations. (Sik-Ho Tsang @ Medium)
1. PeleeNet: Modified DenseNet
- Multiple places of DenseNet are modified in PeleeNet.
1.1. Two-Way Dense Layer
- Motivated by GoogLeNet, a 2-way dense layer to get different scales of receptive fields, as above.
- One way of the layer uses a 3×3 kernel size.
- The other way of the layer uses two stacked 3×3 convolutions to learn visual patterns for large objects.
1.2. Stem Block
- Motivated by Inception-v4 and DSOD, a cost efficient stem block is designed before the first dense layer, as above.
- This stem block can effectively improve the feature expression ability without adding computational cost too much.
1.3. Dynamic Number of Channels in Bottleneck Layer
- In DenseNet, for the first several dense layers, the number of bottleneck channels is much larger than the number of its input channels, which means that for these layers, bottleneck layer increases the computational cost instead of reducing the cost.
- To maintain the consistency of the architecture, the bottleneck layer is still added to all dense layers, but the number is dynamically adjusted according to the input shape, to ensure that the number of channels does not exceed the input channels.
- Compared to the original DenseNet structure, the experiments show that this method can save up to 28.5% of the computational cost with a small impact on accuracy.
1.4. Transition Layer without Compression
- The compression factor proposed by DenseNet hurts the feature expression.
- The number of output channels is always kept the same as the number of input channels in transition layers.
1.5. Composite Function
- Post-Activation (Conv-BN-ReLU) is used instead of pre-activation used in DenseNet.
- In this case, BN can be merged with convolution layer at the inference stage which can accelerate the speed greatly.
- To compensate for the negative impact on accuracy caused by this change, When a shallow and wide network structure is used, a 1×1 convolution layer is added after the last dense block to get the stronger representational abilities.
2. Pelee: Modified SSD
- Multiple places of SSD are modified in PeleeNet.
2.1. Feature Map Selection
- A carefully selected set of 5 scale feature maps (19×19, 10×10, 5×5, 3×3, and 1×1).
- To reduce computational cost, we do not use 38×38 feature map.
2.2. Residual Prediction Block
- For each feature map used for detection, a residual block (ResBlock) is built before conducting prediction.
2.3. Small Convolutional Kernel for Prediction
- Residual prediction block makes it possible to apply 1×1 convolutional kernels to predict category scores and box offsets.
- The experiments show that the accuracy of the model using 1×1 kernels is almost the same as that of the model using 3×3 kernels.
- However, 1×1 kernels reduce the computational cost by 21.5%.
3. PeleeNet: Network Architecture
- The entire network consists of a stem block and four stages of feature extractor.
- Except the last stage, the last layer in each stage is average pooling layer with stride 2.
- The number of layers in the first two stages are specifically controlled to an acceptable range.
4. PeleeNet: Ablation Study
- A subset of ILSVRC 2012 according to the ImageNet used in Stanford Dogs. These images are images of breeds of dogs. Both training data and validation data are exactly copied from the ILSVRC 2012 dataset.
- Number of categories: 120
- Number of training images: 150,466
- Number of validation images: 6,000
- A DenseNet-like network called DenseNet-41, is used as baseline. There are two differences between this model and the original DenseNet:
- The first one is the parameters of the first conv layer. There are 24 channels on the first conv layer instead of 64, the kernel size is changed from 7×7 to 3×3 as well.
- The second one is that the number of layers in each dense block is adjusted to meet the computational budget.
- After combining all the design choices, PeleeNet achieves 79.25% accuracy on Stanford Dogs, which is higher in accuracy by 4.23% than DenseNet-41 at less computational cost.
5. Results on Image Classification
5.1. ImageNet
- PeleeNet achieves a higher accuracy than that of MobileNetV1 and ShuffleNet V1 at no more than 66% model size and the lower computational cost.
- The model size of PeleeNet is only 1/49 of VGG16.
5.2. Speed on NVIDIA TX2
- PeleeNet is much faster than MoibleNetV1 and MobileNetV2 on TX2.
- Using half precision float point (FP16), PeleeNet runs 1.8 times faster in FP16 mode than in FP32 mode.
- In contrast, the network that is built with depthwise separable convolution is hard to benefit from the TX2 half-precision (FP16) inference engine. The speed of MobileNetV1 and MobileNetV2 running in FP16 mode is almost the same as the ones running in FP32 mode.
5.3. Speed on iPhone 8
- Simliarly, PeleeNet obtains higher accuracy with smaller model size.
6. Results on Object Detection
6.1. Effects of Various Design Choices
- The model with residual prediction block achieves a higher accuracy by 2.2% than the model without residual prediction block.
- The accuracy of the model using 1×1 kernels for prediction is almost same as the one of the model using 3×3 kernels. However, 1×1 kernels reduce the computational cost by 21.5% and the model size by 33.9%.
6.2. PASCAL VOC 2007
- The accuracy of Pelee is higher than that of TinyYOLOv2 by 13.8% and higher than that of SSD+MobileNetV1 by 2.9%.
- It is even higher than that of YOLOv2–288 at only 14.5% of the computational cost of YOLOv2–288.
- Pelee achieves 76.4% mAP when we take the model trained on COCO trainval35k as described in Section 3.3 and fine-tuning it on the 07+12 dataset.
6.3. Speed on Real Devices
- Although residual prediction block used in Pelee increases the computational cost, Pelee still runs faster than SSD+MobileNetV1 on iPhone and on TX2 in FP32 mode.
- Also, Pelee has a greater speed advantage compared to SSD+MobileNetV1 and SSDLite+MobileNetV2 in FP16 mode.
- Pelee, our proposed object detection system, can run 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2 with high accuracy.
6.4. COCO
- Pelee is not only more accurate than SSD+MobileNetV1, but also more accurate than YOLOv2 in both mAP@[0.5:0.95] and mAP@0.75.
- Meanwhile, Pelee is 3.7 times faster in speed and 11.3 times smaller in model size than YOLOv2.
Reference
[2018 NeurIPS] [Pelee & PeleeNet]
Pelee: A Real-Time Object Detection System on Mobile Devices
Image Classification
…
2018: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock] [Group Norm (GN)] [Pelee & PeleeNet]
2019: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix] [MixConv] [EfficientNet] [ABN] [SKNet] [CB Loss] [AutoAugment, AA]
2020: [Random Erasing (RE)] [SAOL] [AdderNet]
2021: [Learned Resizer]
Object Detection
…
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet] [CornerNet] [Pelee & PeleeNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD] [CenterNet] [Grid R-CNN] [NAS-FPN] [ASFF]
2020: [EfficientDet]