Visual explanation for video recognition
Understanding what neural networks see when classifying videos
This post describes how temporally-sensitive saliency maps can be obtained for deep neural networks designed for video recognition. It is evident from the previous works [2, 3, 4] that saliency maps help visualize why a model produced a given prediction and can uncover artifacts in the data and point towards better model architectures.
Task: Recognizing human actions in videos from our recently released dataset requires a fine-grained understanding of concepts like three-dimensional geometry, material properties, object permanence, affordance and gravity . The dataset, dubbed “Something-Something”, consists of ~100,000 videos across 174 categories containing concepts such as dropping, picking, pushing etc.
A few examples from the dataset:
Visualization Technique used: Grad-CAM
Grad-CAM or Gradient-weighted Class Activation Mapping, proposed by , allows us to obtain a localization map for any target class. It involves,
- Calculating gradients of a class logit with regard to activation maps corresponding to the final convolutional layer.
- Taking weighted average of these activation maps by using the gradients as weights.
- Finally, applying ReLU to highlight regions that positively correlates with the class chosen.
- Projecting the obtained result back to the input space in the form of heatmaps (coarse localization maps).
Please refer  for more details.
For videos, a natural choice is to consider a video as a sequence of image frames and extend 2D-CNN filters in the time domain to obtain 3D-CNN, which proved useful for video recognition tasks [5, 6]. We inflated ImageNet pre-trained ResNet-50 filters in the time domain, following similar lines of work done by  for Inception-v1 and trained the resulting model on our dataset, choosing a subset of 40-classes as described in .
The dimensions of the final convolutional layer’s activations is 16×2048×7×7, with input of the dimensions 16×3×224×224, following the convention of [number of image frames×num channels×width×height]. We chose a uniform kernel size of 3 in the time domain with padding and stride of 1. This results in activation maps having the same time dimension as the input but not uncorrelated in time.
The 40-classes subset of the data contains 53,267 total samples with splits made in the 8:1:1 ratio . The test-set accuracy of the above architecture is 51.1%, which is ~15% better than what is reported in our paper at 36.2%.
Temporal localization maps
Using the above trained model, we took some random samples and visualized them using Grad-CAM [4, 7]. The data is sampled at 4fps. With a clip size of 16 frames (see above), the videos represent 4 seconds of activity at most.
The examples below show the original video along with a heat map overlaid version of it (red — intense). Also, the true label and top-2 predictions are shown beside each example.
A few positive ones:
1. Putting [something]: 0.84
2. Dropping [something]: 0.10
1. Tearing [something]: 0.99
2. Stacking [number of] [something]: 0.00
1. Uncovering [something]: 0.99
2. Opening [something]: 0.00
1. Closing [something]: 0.96
2. Opening [something]: 0.02
1. Pushing [something] so that it slightly moves: 0.43
2. Pretending to take [something] from [somewhere]: 0.20
1. Approaching [something] with your camera: 0.26
2. Dropping [something]: 0.15
1. Picking [something] up: 0.99
2. Putting [something]: 0.00
A few medium ones:
1. Turning the camera downwards while filming [something]: 0.67
2. Picking [something] up: 0.10
1. Turning the camera left while filming [something]: 0.21
2. Turning the camera right while filming [something]: 0.21
1. Dropping [something]: 0.97
2. Throwing [something] against [something]: 0.01
1. Pushing [something] with [something]: 0.34
2. Picking [something] up: 0.25
A few negative ones:
1. Turning [something] upside down: 0.67
2. Turning the camera left while filming [something]: 0.07
1. Dropping [something]: 0.50
2. Picking [something] up: 0.11
1. Pushing [something] with [something]: 0.19
2. Picking [something] up: 0.15
Looking carefully, the above examples convey that the model, in most cases, has learned to follow the object of interest over time. We will follow up on this work in the future.
At TwentyBN, with the help of our proprietary data collection platform, we are collecting hundreds of thousands of videos describing fine-grained concepts in the world with the aim to enable a human-like visual understanding of the world. Recently, we released two large-scale video datasets (256,591 labeled videos), and we believe our efforts in this direction will help the community to take on further challenges.
 Goyal et al. ‘The “something something” video database for learning and evaluating visual common sense.’ arXiv preprint arXiv:1706.04261 (2017). In ICCV 2017. [To appear]
 Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.
 B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In CVPR, 2016.
 Selvaraju, Ramprasaath R., et al. “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization.” arXiv preprint arXiv:1610.02391 (2016). In ICCV 2017. [To appear]
 D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015.
 Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” arXiv preprint arXiv:1705.07750 (2017).