Recognizing human actions in videos

How we placed third in the 2017 ActivityNet challenge.

We recently placed third in the trimmed action recognition category of the ActivityNet challenge, held as a workshop at CVPR 2017. The dataset for this category, Kinetics, was released by Google DeepMind. The task is to recognize activities in trimmed video sequences where each video contains a single activity (label) with a duration of not more than 10 seconds. There are ~300,000 videos in total across 400 action classes that are curated from YouTube.

The paper that DeepMind published as part of the Kinetics-dataset release reports a baseline accuracy of 61.0% (top-1) and 81.3% (top-5), which was surpassed by the top entries in the challenge with a significant gap.

ActivityNet 2017 results from CVPR workshop presentation. The above leaderboard is for Task 2: Trimmed Action Recognition a.k.a. Kinetics challenge

As evidenced by the leaderboard, we obtained the best top-5 accuracy among the top 3 submissions, which is a 15.7 % improvement over the baseline in absolute terms. Overall, we placed third on the basis of the average of top-1 and top-5 score.

In this post, we will detail the approach for our submission. In essence, our approach exploits both the audio and visual streams that are present in the videos. We trained deep neural networks that captured different statistical modalities present in the data and ensembled the obtained predictions to get the final result.

We used PyTorch for all our submissions during the challenge. It is a great deep learning library for fast prototyping and is more straightforward than static graph libraries that don’t allow you to define, change and execute nodes as you go.

We preprocessed videos and bursted them to frames with 8 FPS. These frames were then bundled into a particular data format that we dubbed as GulpIO. We developed GulpIO in parallel to our challenge to target lightning-fast video data transfer between disk and GPU/CPU memory. We plan to open-source GulpIO with the community soon and more details will follow.

2D-CNN models

Starting from basic instincts, we extracted ResNet-101’s 2048 dimensional features after the last pooling layer for each frame in the video. We pooled the obtained features by averaging them and trained a multilayer perceptron (MLP) on top of it.

Resfeat-1: We experimented with different non-linearities and obtained:

  • PReLU: 64% (top-1),
  • Maxout units: 65% (top-1)

Resfeat-2: In addition to the above, we clustered the obtained features into 25 groups by RSOM [3] and trained a MLP on top of it to obtain:

  • PReLU: 66.2% (top-1),
  • Maxout units: 67.8% (top-1)

This simple approach yielded results that were superior to the baseline. This finding sheds light on the importance of the non-motion specific context in the videos of this dataset, as opposed to instant temporal changes. For example, it is oftentimes sufficient to see waves and a surfboard to conclude that someone is “surfing”.

3D-CNN models

We also investigated 3D models which processes video segments by convolving not only the spatial dimension but also the temporal dimension. Consequently, it allows the network to learn temporal regularities in the videos.

Resnet3D: Inspired from [2], we inflated Resnet-50 layers in the time-domain to obtain an ImageNet initialization for the 3D CNN model, and obtained an accuracy of 64.30% (top-1) and 85.58% (top-5)

Optical-flow: We used OpenCV for computing dense optical flow and converted 2-channel optical flow vectors (u, v) into its magnitude and direction before storing them as RGB images for the sake of compression. Using the above Resnet3D architecture on these frames, we obtained an accuracy of 42.65% (top-1) and 68.09% (top-5). These results were not comparable with the state-of-the-art optical flow model but we could not investigate this thread further due to time constraints during the competition. However, to this end, it is as important to find the right optical flow pipeline as it is to find an optimal network architecture.


We used the DeepSpeech-2 architecture [1] to construct an audio model and averaged the obtained RNN hidden state outputs to pass them through a FC layer for the classification task. It solved most of the audio specific classes with high confidence. Overall, we obtained a validation accuracy of 17.86% (top-1) and 34.39% (top-5).


During the challenge, we introduced a novel 3D CNN architecture that employs separable convolution filters for the spatial and time domain. In addition, we added random dilation in the temporal dimension as a regularizer to make the model more robust towards information contained at different time-scales. This single model yielded a validation top-1 accuracy of 70% and was our best single model. More details can be in an accompanying post here.

Entropy-based hard-mining.

We applied hard-mining to BesNet training. To this end, we waited for the model to get decent results on the validation set. Then we started to pick hard instances defined by high entropy on the predicted confidences. This led to picked instances that were not confidently classified as hard instances. We trained the network for an additional epoch with these hard instances and repeated this routine a couple of times. It increased top-1 accuracy from 70% to 72% for BesNet. Although we observed over-fitting with regular hard-mining, the entropy-based approach improved our results without further implications.


All the described models were ensembled in the end:

1. Maxing out
We max-pooled the predictions obtained from the above models and obtained a 77% (top-1) score. Besides this, we obtained 71% (top-1) with average-pooling and 72% (top-1) with majority voting.

2. Stacking
We employed the following procedure:

  • Train models with normal train and validation split.
  • Get class confidence values for validation set from each model.
  • Divide validation set into 2 as val-1 and val-2.
  • Train a meta-model on val-1 with first-level class confidences.
  • Check validation loss on val-2

Note: All accuracies reported above are evaluated on the validation split of the data.

Schematic Diagram

Our final submission


At TwentyBN, we are committed towards the goal of solving video understanding and building AI systems that enable a human-like visual understanding of the world. To catalyze the development of machines that can perceive the world like humans, we recently released two large-scale video datasets (256,591 labeled videos) to teach machines visual common sense. The datasets were published alongside a paper that will appear at ICCV 2017.

The Kinetics dataset, one of the largest activity-recognition dataset, was sourced and filtered from YouTube videos. Like other datasets that use videos from the web, Kinetics falls short of representing the simplest physical object interactions that will be needed for modeling visual common sense. Nonetheless, we believe that our ability to obtain competitive results on this dataset, with limited time and resources at hand, sends a strong signal of our commitment to dramatically improve state-of-the-art video understanding systems.

By Eren Golge, Raghav Goyal and Valentin Haenel


[1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, 2016.

[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750, 2017

[3] E. Golge and P. Duygulu. Conceptmap: Mining noisy web data for concept learning. In European Conference on Computer Vision, pages 439–455. Springer, Cham, 2014.