GluonCV 0.6: Embrace Video Understanding

Published in

Apache MXNet

6 min readFeb 12, 2020

Author: Yi Zhu, Applied Scientist at Amazon

Video understanding has been a trending research topic because analyzing dynamic videos can help us develop better computer vision algorithms which will lead to stronger AI. However, there are many obstacles hindering the progress of video research, such as the enormous datasets, long experiment cycles, lack of reproducible codebases, lack of tutorials and difficulties with edge device deployment.

In this new release, GluonCV addressed the above limitations and we are happy to announce that GluonCV now fully supports state-of-the-art video classification algorithms and major datasets. Together with the release, we also provide a new fast video reader, distributed training support, extensive tutorials and reproducible benchmarks. Using GluonCV, you can easily learn, develop and deploy video understanding models without worrying too much about engineering details.

More Pre-Trained Models and Datasets

Recently, FAIR just open-sourced their PySlowFast codebase for video understanding. Compared to PySlowFast, GluonCV provides support for more models and datasets. For example, this release covers state-of-the-art algorithms, such as TSN, C3D, I3D, P3D, R2+1D, Non-local and SlowFast, and supports 4 widely-adopted datasets, UCF101, HMDB51, Kinetics400 and Something-Something-V2. More models (i.e., TVN, TSM) and datasets (i.e., AVA, HACS, MiT) support are on the way.

Here is a table listing our pre-trained models on Kinetics400 dataset.

As you can see, this release has nice coverage for combinations with various datasets and model families. You can choose any model to fit your use case with just one line of code, net = get_model(model_name).The get_model function is defined in GluonCV. For more information about model accuracy and speed, interested readers can refer here.

Fast Video Reader: Decord

With the increasing amount of videos in each dataset, pre-processing and loading the dataset becomes a complicated and tricky process. Take Kinetics400 dataset as an example, it is a widely adopted benchmark in video understanding domain with about 300K videos. If all the videos in Kinetics400 are decoded into frames, the total number of frames we get will be 100 times larger than the total number of images in ImageNet dataset. Kinetics400 dataset needs 450GB disk space to store all its videos and requires 6.8TB disk space when it’s decoded into frames. Such a huge amount of data would make I/O become the tightest bottleneck during training, leading to a waste of GPU resources and longer experiment cycles.

Here, we introduce a new video reader, Decord. You can use it to load videos directly from disk. The usage of Decord is quite simple. Reading frames from a video is similar to Numpy indexing, bringing the learning cost to nearly zero.

Our new video reader is at least two times faster than OpenCV VideoCapture and PyAV VideoContainer. Especially in the situation of random seek, Decord is eight times faster. Using Decord will significantly speed up your experiments.

Easy-to-use customized API

GluonCV provides two customized APIs, customized dataloader and customized model, to help users write the least code and achieve the most.

Firstly, we introduce a class, VideoClsCustom, as customized dataloader. It is suitable for most video classification tasks. No matter where you store your data in any format, you only need to prepare a text file as below to start training,

There should be three items per line: the path to the video, the length of the video and the category of the video. If you already decode your video into frames, don’t worry, just replace the video path to the frame folder path. Let’s show an example of how to read a 64-frame video clip by skipping every other frame with size 224x224 from a training sample.

There are many other parameters you can adjust, such as temporal jittering, which video reader to use, etc. Our customized dataloader class VideoClsCustom is able to satisfy the training needs of most datasets and models by adjusting these parameters. Actually, our UCF101, HMDB51, Something-Something and Kinetics400 dataloaders all inherit from this class without writing new code. One interface for all.

We also provide several customized models for users to quickly get started, such as slowfast_4x16_resnet50_custom.For example, if you want to train a model for video anomaly detection, which is a binary classification problem. You can simply build the model as

With these two powerful customized API, users can start training and testing on their own dataset in a few lines of code. To get you started, we provide a tutorial for you to try out our fast videoReader and customized APIs.

Extensive tutorials

Video understanding progressed rapidly in the last five years. However, stable open-sourced toolkits, such as MMAction, PyVideoResearch, VMZ and PySlowFast, have only recently been released. Most of the toolkits assume users already have the knowledge of video understanding, and only provide training commands and a model zoo. Newcomers to the field have to read a dozen of publications first before they can understand and know how to use the toolkit.

GluonCV provides extensive tutorials in Jupyter notebooks because we believe learning by doing is the best approach. Users can learn directly on their local machine, such as use pre-trained video classification models, how to train a state-of-the-art models, how to extract features, how to fine-tune on their own dataset, etc. GluonCV supports Windows, Linux and Mac.

One thing to emphasize is that we have good support for distributed training. We have an easy-to-follow step-by-step guide to show users how to setup a cluster, how to prepare the data and how to kickstart the training. The scalability performance of our distributed training is promising. Without bells and whistles, we can speed up training 1.6x using 2 machines, 3.2x using four machines and 6x using eight machines. Using distributed training can significantly reduce your experiment cycle, which is crucial for both academic research and industrial deployment.

Quantized Models for Fast Deployment

We continue to collaborate closely with Intel on adding more INT8 models in GluonCV. Powered by Intel Deep Learning Boost (VNNI), INT8-quantized models in GluonCV can achieve significant speedup (5x) over their 32bit floating-point counterparts. The following performance results were benchmarked on AWS EC2 C5.12xlarge instance with 24 physical cores. Note that you will need the latest nightly build version of MXNet to properly use these new features.

Usage of INT8-quantized models is identical to standard GluonCV models, simply add the suffix_int8 to the names of the models and your need for speed is satisfied!

We also delivered a calibration tool for users to quantize their models to INT8 on their own dataset. Currently, the calibration tool only supports hybridized gluon models. Users could quantize their own gluon hybridized model by using quantize_net API.

Summary

GluonCV has great coverage of models and dataset in video understanding domain. Our goal is to help engineers, researchers, and students to fast prototype products and research ideas. We are working on providing faster model training, neural architecture search support and more video applications. We will actively maintain and update GluonCV.

Please Like/Star/Fork/Comment/Contribute if you like GluonCV!