Which one is the best algorithm for video action recognition?

Yi Zhu
Apache MXNet
Published in
3 min readJan 6, 2021

Author: Yi Zhu, Applied Scientist at Amazon

A chronological overview of recent representative work in video action recognition

Over time, computer vision researchers have shifted their focus from image to video, 2D to 3D, and supervised to unsupervised. One of the trends, video understanding, has become a hot topic. Video human action recognition, a basic task within video understanding, also attracts lots of attention. As shown in the timeline above, more and more algorithms on video action recognition are proposed each year. So, how do newcomers to this field know which is the right model for their use-case?

To address this question, here comes the GluonCV 0.9.0 release, which provides not only a large model zoo for video action recognition (46 pretrained models, both in PyTorch and Apache MXNet), but also step-by-step tutorials (feature extraction, model finetuning, FLOPS computation), a 30-page survey paper covering 200+ recent literatures, and video lectures on YouTube. If you want to get started with computer vision in video, don’t hesitate to try it out as we believe this will help you develop some new skills, and figure out the right model for your scenario.

Our survey paper and CVPR2020 tutorial video lectures.

Video researchers have long found it difficult to compare results due to variances between datasets and differing evaluation methods. In response, we began reproducing a large number of popular algorithms using the same dataset and the same data augmentation steps (see readme for steps to reproduce the results). Throughout this project we gained several interesting observations. First, despite the fact that 3D CNNs have higher accuracy than 2D CNNs, they have higher latency which makes them not ideal for deployment to edge devices or real-time inference. Second, pre-training a model on a large-scale dataset is usually more effective than improving the model itself. For example, the CSN model (ICCV 2019) pretrained on a large-scale dataset easily outperforms recent methods by a large margin. Hence, for real-world applications, it might be more cost effective to collect and clean your data, rather than pursuing the latest SOTA model.

Benchmark results on Kinetics400 dataset. Time computation does not include IO cost.

Furthermore, training a SOTA video action recognition model consumes lots of compute resources. Even using a high-end machine with 8 V100 GPUs will take more than a week to obtain a decent accuracy. To help you iterate faster, we provide support for the DistributedDataParallel (DDP) in PyTorch, and Multi-grid training. As shown in the bar plot below, under the same setting of using 8 GPUs, the baseline using DataParallel (DP) needs 250 hours to finish the 100-epoch training of I3D model, while GluonCV can finish it within 41 hours, which is 6 times faster without performance degradation. If we use 4 machines with a total of 32 V100s, the training can be done in 10 hours, achieving near-linear scalability. By further comparison, the mmaction2 toolbox required 148 hours to train the same model.

Training time comparison (a standard I3D model with ResNet50 backbone).

Summary

In summary, GluonCV 0.9.0 provides a whole package for learning video action recognition: a survey paper, video lectures, Jupyter demos, the model zoo, and a deployment tutorial. In future releases, we will update more models in PyTorch, including object tracking, multi-modality video modeling, self-supervised representation learning, etc. Welcome aboard with GluonCV, feel free to raise issues, and contribute back!

Acknowledgement

Special thanks to @Arthurlxy @ECHO960 @zhreshold @yinweisu for their support in this release. Thanks to @coocoo90 for contributing the CSN and R2+1D models. And thanks to other contributors for the bug fixes and improvements. Please Like/Star/Fork/Comment/Contribute if you like GluonCV!

--

--