GluonCV — Deep Learning Toolkit for Computer Vision

Published in

Apache MXNet

4 min readMay 16, 2018

Author: Mu Li, Principal Scientist at Amazon
Translated from: https://zh.mxnet.io/blog/gluon-cv

Origin

Someone once asked me what was the hardest thing to do when developing MXNet. I would not hesitate to say that replicating experimental results from papers is the most difficult part. Here are three examples:

Lin Min (Network in Network proposer) discovered in 2016 that the accuracy of the model trained by MXNet on ImageNet was 1% lower than that of Torch. To debug this issue, he even developed a plugin to run directly Torch code in MXNet in order to compare results. Finally, he discovered that the root cause of the problem was that the default pre-processing export quality of the JPEG images was set to 85%. If we changed the image export quality to 95%, we could recover the lost 1% in precision.
After the Inception V3 paper publication, Bing Xu (one of the authors of the GAN paper) took weeks to re-implement the architecture, because Google didn’t publish their code and some of the details in the paper were unclear. Fortunately, you can always reach out to the authors of the original paper, but it can take a lot of back and forth before you can reach parity with the published results.
One of my doctoral advisors at CMU (he reported to Jeff Dean at Google) was involved in a code migration to use a new version of TensorFlow’s API and noticed a drop in accuracy of a model. It took months to find the root cause, and finally they realized that the data augmentation techniques used in each version were slightly different.

The heroes in the three examples above are top-level researchers in the field of deep learning, but it is still very easy to spend a lot of precious time on some subtle experimental details. A model usually has tens to hundreds of layers and can take several hours to train. Added to this, the model initialization and the order in which the data is read are usually randomized. All of this makes debugging and reproducing experiments difficult.

Fortunately, in recent years, with the power of the open source community, you can find publicly available implementations of the major papers on Github. But this does not solve every problem:

The quality of 3rd party implementations of papers varies a lot. Some implementations might not come close to the papers’ claimed results.
There are a lot of details that might vary slightly from one implementation to another, such as the input data format, the deep learning framework, and the coding style.
Personally-maintained projects are often focused on a single tasks, e.g. applying only to a particular data set. However, users will care about ease of deployment or how to use a different training data set. It actually takes time to repurpose a model for your own use.
Finally, the code maintainer could stop providing support for the project. For example, I wrote some projects during my Ph.D. but then shifted my work and life focus and I didn’t have the energy anymore to respond to every users’ questions. If users encounter a bug, and can’t contact the maintainer quickly, they could easily be stuck for a long time.

After understanding these pain points, several of us who were engaged in computer vision, Zhi Zhang (@zhreshold), Hang Zhang (@zhanghang1989), Tong He (@hetong007), Eric Xie (@piiswrong), scratched our heads and said, let’s create a toolkit to try to solve these problems.

Who’s the toolkit for?

We want a toolkit that can be used not only by experienced users (i.e. a few years of computer vision experience) but also a project that can help newcomers to the field (i.e. a few months of computer vision experience). This cohort includes:

Engineers who want to quickly apply visual technologies to products
Researchers wishing to propose new algorithms and need a baseline to compare their work

Of course, if you are just starting to learn, please refer to “MXNet: The Straight Dope” or if you are interested in applications outside of computer vision, please look forward to the other toolkits we will be releasing soon, e.g. our toolkit for Natural Language Processing.

So what’s included?

Based on user feedback, this toolkit provides the following features:

Reimplementation of important papers in recent years
Detailed documentation and well-documented examples
Pre-trained models that can be used directly
Performance metrics, to help choose between different models
Consistent interface to reduce the barrier to entry when switching from one model to another
Regular re-training and continuous integration to ensure code correctness

What does it look like?

The following code downloads the pre-trained SSD model and then performs object detection on the image ‘street.jpg’ and presents the results. (Specific code explanation can be found here.)

from gluoncv import model_zoo, data, utils

net = model_zoo.get_model('ssd_512_resnet50_v1_voc',pretrained=True)

x, img = data.transforms.presets.ssd.load_test('street.jpg', short=512)

class_IDs, scores, bounding_boxs = net(x)

utils.viz.plot_bbox(img, bounding_boxs[0],
    scores[0], class_IDs[0], class_names=net.classes)

How can I get started?

GluonCV is hosted here: gluon-cv.mxnet.io. So far we have released the first preview version, which includes three models, all of which achieve the same results as the original papers.

Image Recognition: Training ResNet on ImageNet
Object Detection: Training SSDs on Pascal VOCs
Semantic Segmentation: Training FCNs on Pascal VOCs

Naturally we will continue to add new models in the next versions. If you are interested in learning about which models, please join the discussion on the github repo or on the forum.