Which one is the best algorithm for video action recognition?
Author: Yi Zhu, Applied Scientist at Amazon
Over time, computer vision researchers have shifted their focus from image to video, 2D to 3D, and supervised to unsupervised. One of the trends, video understanding, has become a hot topic. Video human action recognition, a basic task within video understanding, also attracts lots of attention. As shown in the timeline above, more and more algorithms on video action recognition are proposed each year. So, how do newcomers to this field know which is the right model for their use-case?
To address this question, here comes the GluonCV 0.9.0 release, which provides not only a large model zoo for video action recognition (46 pretrained models, both in PyTorch and Apache MXNet), but also step-by-step tutorials (feature extraction, model finetuning, FLOPS computation), a 30-page survey paper covering 200+ recent literatures, and video lectures on YouTube. If you want to get started with computer vision in video, don’t hesitate to try it out as we believe this will help you develop some new skills, and figure out the right model for your scenario.
Video researchers have long found it difficult to compare results due to variances between datasets and differing evaluation methods. In response, we began reproducing a large number of popular algorithms using the same dataset and the same data augmentation steps (see readme for steps to reproduce the results). Throughout this project we gained several interesting observations. First, despite the fact that 3D CNNs have higher accuracy than 2D CNNs, they have higher latency which makes them not ideal for deployment to edge devices or real-time inference. Second, pre-training a model on a large-scale dataset is usually more effective than improving the model itself. For example, the CSN model (ICCV 2019) pretrained on a large-scale dataset easily outperforms recent methods by a large margin. Hence, for real-world applications, it might be more cost effective to collect and clean your data, rather than pursuing the latest SOTA model.
Furthermore, training a SOTA video action recognition model consumes lots of compute resources. Even using a high-end machine with 8 V100 GPUs will take more than a week to obtain a decent accuracy. To help you iterate faster, we provide support for the DistributedDataParallel (DDP) in PyTorch, and Multi-grid training. As shown in the bar plot below, under the same setting of using 8 GPUs, the baseline using DataParallel (DP) needs 250 hours to finish the 100-epoch training of I3D model, while GluonCV can finish it within 41 hours, which is 6 times faster without performance degradation. If we use 4 machines with a total of 32 V100s, the training can be done in 10 hours, achieving near-linear scalability. By further comparison, the mmaction2 toolbox required 148 hours to train the same model.
Summary
In summary, GluonCV 0.9.0 provides a whole package for learning video action recognition: a survey paper, video lectures, Jupyter demos, the model zoo, and a deployment tutorial. In future releases, we will update more models in PyTorch, including object tracking, multi-modality video modeling, self-supervised representation learning, etc. Welcome aboard with GluonCV, feel free to raise issues, and contribute back!
Acknowledgement
Special thanks to @Arthurlxy @ECHO960 @zhreshold @yinweisu for their support in this release. Thanks to @coocoo90 for contributing the CSN and R2+1D models. And thanks to other contributors for the bug fixes and improvements. Please Like/Star/Fork/Comment/Contribute if you like GluonCV!