GluonCV 0.7: ResNeSt, Next Generation Backbone

Jerry Zhang
Apache MXNet
Published in
4 min readMay 14, 2020

Authors: Jerry Zhang, Thomas Brady

Since the introduction of AlexNet, the 2012 ImageNet challenge champion, neural networks trained for image classification have been used as backbones for addressing other tasks such as object detection, semantic and instance segmentation, or pose estimation. We typically called these backbone network, as it is shared by variety of tasks. ResNet, introduced in 2015, through its many permutations, has been the reigning backbone for awhile now. In fact, the majority of research conducted for applications downstream of image classification are still using ResNet despite the great strides made in computer vision by the research community. ResNet’s continued efficacy is often attributed to its modularity, and ease-of-use for transfer learning. With modularity and transfer learning in mind, the GluonCV team is pleased to introduce the new backbone network ResNeSt in this release (GluonCV 0.7).

ResNet vs SE-Net vs ResNeSt

ResNeSt retains the inherent modularity and transfer learning capabilities of ResNet while boosting the accuracy of a variety of vision tasks, including image classification, object detection and semantic segmentation. Besides ResNet, our work also takes inspiration from the use of channel attention in SE-Net, a network designed in 2017. When comparing ResNeSt to the previous state-of-the-art backbone network EfficientNet partially designed using neural architecture search, we discovered that ResNeSt amplifies both speed and accuracy on a GPU. For example our ResNeSt-269 achieves slightly higher accuracy than EfficientNet-B7, while lowering the latency by around 30%. In addition, we discovered it is easy to adapt ResNeSt to downstream tasks like object detection and semantic segmentation. By simply swapping ResNet with ResNeSt without tuning hyper-parameters, we improve Faster R-CNN COCO mAP by approximately 4%, and DeepLabV3 ADE20K mIoU by around 3% assuming the previous best performance of a ResNet-based model as our baseline.

Image Classification

Backbone networks are usually pre-trained on the ImageNet-1K dataset with their weights used for various downstream tasks. Thus, accurate classification of images is of great importance to high-level computer vision. Included in the GluonCV 0.7 release are the four latest ResNeSt backbones with differing levels of complexity accompanied by the respective code we used for training to reproduce our results. In previous releases, our best result came from using SENet-154 model as the classification backbone, which produced a top-1 accuracy score of 81.26% on the ImageNet dataset. All our new models, except for ResNeSt-50, achieves higher accuracy than SENet-154 our most accurate model from the previous release. The following are the detailed results:

Average Latency vs. Top-1 Accuracy on ImageNet

In addition, we benchmarked our ResNeSt model against EfficientNet, using a single V100 GPU with a batch size of 16. As shown in the below graph, ResNeSt outperform EfficientNet, with higher accuracy and lower latency.

Object Detection

ResNeSt achieves great result on image classification, but how does it perform on other downstream tasks? To demonstrate that ResNeSt can improve downstream tasks, we changed the original ResNet to ResNeSt in Faster R-CNN resulting in the mean Average Precision (mAP) improving by 3% as reported in our paper. In GluonCV 0.7, we include a new bag of tricks for our Faster R-CNN models, such as synchronized batch normalization, random scale augmentation, and deeper box head (4 convolutions + 1 dense). With these improvements, we are able to increase the mAP to 42.7, which is higher than the previous result with ResNet-101. This is slightly higher than what we report in paper, as we use 26 epochs (2x learning rate schedule) as oppose to the 13 epochs in the paper .

Object Detection with Faster R-CNN

Semantic Segmentation

We also provide two new semantic segmentation models in this release. In our research, we swapped DeepLabV3’s ResNet backbone with ResNeSt, and obtained a 2.8% gain in mean intersection over union (mIoU) and 1% gain in pixel accuracy, reaching a state-of-the-art result on the ADE20K dataset. By simply dropping in ResNeSt as the backbone, our models outperformed numerous other models designed specifically for semantic segmentation such as ACNet, HRNet, etc. demonstrating the versatility of ResNeSt to generalize to different tasks.

Semantic Segmentation with DeepLabV3

Summary

GluonCV 0.7 brings you the latest image classification backbone that significantly improves downstream tasks. New models introduced in this release improves upon our existing model zoo, provide you with more potent computer vision models. With GluonCV 0.7, you can now use our state-of-the-art ResNeSt in your research or production. For more detail you can also checkout our paper here.

Acknowledgement

We sincerely thank the following contributors:
@zhreshold, @adursun, @KuangHaofei, @bryanyzhu, @FrankYoungchen, @ElectronicElephant, @lgov, @astonzhang, @ruslo, @mjamroz, @LauLauThom, @karan6181, @turiphro, @chinakook, @zhanghang1989, @Jerryzcn

Links

Please Like/Star/Fork/Comment/Contribute if you like GluonCV!
GluonCV Website
GluonCV Github

--

--

Jerry Zhang
Apache MXNet

Research engineer working on computer vision and machine learning.