GluonCV 0.5: 15 New Models

Published in

Apache MXNet

5 min readSep 16, 2019

Author: Yi Zhu, Applied Scientist at Amazon

Many thanks to all of our GluonCV users who have given valuable feedback over the last couple of months. We’ve listened, and are happy to announce some new features in the latest version of GluonCV: Version 0.5.

Video Human Action Recognition Datasets and Models
MobileNetV3 Models for Classification
More Quantized Models for Segmentation
AlphaPose for Pose Estimation
VPLR for Semantic Segmentation

With these additional 15 models, the GluonCV Model Zoo now has over 190 pre-trained models, across a wide variety of tasks. You’ll find models for Image Classification, Object Detection, Semantic Segmentation, Instance Segmentation, Pose Estimation and Action Recognition.

Video Human Action Recognition

Action recognition is a fundamental task in video analysis. Its objective is to determine the actions being performed by people in a given video. We provide the functionality to perform video action recognition in GluonCV. You’ll find examples of all the components used in complete applications, which include: model definitions, training scripts, loss and metric functions. We also include some pre-trained models and tutorials for bootstrapping your applications. Check out the sample video below, that’s labelled with example predictions.

The following table summarizes our pre-trained video action recognition models on UCF101 and Kinetics400 dataset with state-of-the-art performance. More state-of-the-art models (I3D, SlowFast, etc.) are coming in the next release. Please stay tuned for our future updates.

MobileNetV3 Models for Classification

MobileNetV3 is a new efficient neural network architecture tuned for mobile CPUs. It’s optimized through a combination of hardware-aware network architecture search (complemented by the NetAdapt algorithm) and then subsequently improved through novel architecture advances. It is 2x faster and more accurate than its predecessor, MobileNetV2. We include both MobileNetV3-Large and MobileNetV3-Small in this release. They are targeted for high and low resource use cases respectively. As far as we are aware, it is the first open-source implementation that reproduces the accuracy in the original paper.

Quantized Models for Segmentation

We continue to collaborate closely with Intel on adding more INT8 models in GluonCV. Powered by Intel Deep Learning Boost (VNNI), INT8-quantized models in GluonCV can achieve significant speedup over their 32bit floating-point counterparts. The following performance results were benchmarked on AWS EC2 C5.12xlarge instance with 24 physical cores. Note that you will need the latest nightly build version of MXNet to properly use these new features.

The following table summarizes new quantized models for semantic segmentation in this release. For segmentation models, the accuracy metric is pixel accuracy (PixAcc). Usage of INT8-quantized models is identical to standard GluonCV models, simply add the suffix _int8 to the names of the models and your need for speed is satisfied!

We also delivered a calibration tool for users to quantize their models to INT8 on their own dataset. Currently, the calibration tool only supports hybridized gluon models. Users could quantize their own gluon hybridized model by using quantize_net API.

AlphaPose for pose estimation

AlphaPose is an accurate multi-person pose estimator developed by SJTU, which is the first real-time open-source model that achieves 70+ mAP (72.3 mAP) on the COCO dataset and 80+ mAP (82.1 mAP) on MPII dataset. We reproduced AlphaPose and included it in this release.

VPLR for Semantic Segmentation

We ported Improving Semantic Segmentation via Video Propagation and Label Relaxation to GluonCV. This method achieves state-of-the-art performances on three driving semantic segmentation benchmarks (Cityscapes, CamVid and KITTI).

In addition, the model has great generalization capability because it is trained with more video frames and a boundary relaxation technique. Below are two examples using the pre-trained model to evaluate Google street view images. Note that, the model is trained on image data captured in Germany cities, while the Google street view images are taken from California, United States. We also make a comparison with the widely adopted PSPNet algorithm, and our results are much more robust. Overall VPLR shows higher segmentation accuracy and crispier boundaries.

Other bug fixes and improvements

In addition to new features, there have been a number of bug fixes and other improvements included in the release. Some of the key highlights include:

RCNN added automatic mix-precision and Horovod integration. Close to 4x improvements in training throughput on 8 V100 GPU.
RCNN added multi-image per device support.

Acknowledgement:

We sincerely thank the following contributors:
@xinyu-intel @hetong007 @zhreshold @bryanyzhu @Jerryzcn @zhanghang1989 @Laurawly @mli @eric-haibin-lin @astonzhang @lgov @zx-code123 @Kh4L @wuxun-zhang @mightydeveloper @cygerts @feynmanliang @szha @zhouhang95 @yd8534976 @wkcn @whitesockcat @vfdev-5 @mrbulb @miraclewkf @hlnull @fourtunechen @douglas125 @algoboy101 @Wondersui @TakeshiKishita @SayHiRay @Jeff-sjtu @HaydenFaulkner @juliusshufan @ifeherva

Links

Please Like/Star/Fork/Comment/Contribute if you like GluonCV!

References

[1] Limin Wang, Yuanjun Xiong, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, European Conference on Computer Vision (ECCV), 2016.
[2] Intel Deep Learning Boost. https://www.intel.ai/intel-deep-learning-boost
[3] Andrew Howard, Mark Sandler, et al. Searching for MobileNetV3, International Conference on Computer Vision (ICCV), 2019
[4] Hao-Shu Fang, Shuqin Xie, et al. RMPE: Regional Multi-person Pose Estimation, International Conference on Computer Vision (ICCV), 2017
[5] Yi Zhu, Karan Sapra, et al. Improving Semantic Segmentation via Video Propagation and Label Relaxation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019