Review — DevNet: Deep Event Network for Multimedia Event Detection and Evidence Recounting (Video Classification)

AlexNet-Like Network with Spatial Pyramid Pooling layer in SPPNet for Video Classification

Sik-Ho Tsang
Nerd For Tech
6 min readJun 13, 2021

--

Given a video for testing, DevNet not only provides an event label but also spatial-temporal key evidences.

In this story, DevNet: Deep Event Network for Multimedia Event Detection and Evidence Recounting, (DevNet), by Tsinghua University, Hong Kong University of Science and Technology, University of Technology, Sydney, and Carnegie Mellon University, is reviewed.

An event is a semantic abstraction of video sequences of higher level than a concept and often consists of multiple concepts.

A long unconstrained video may contain a lot of irrelevant information and even the same event label may contain large intra-class variations.

In this paper:

  • Deep Event Network (DevNet), is designed to simultaneously detect pre-defined events and provide key spatial-temporal evidences.
  • A spatial-temporal saliency map can be generated localize the key evidence.
  • This is the first paper to use CNN for the above task.

This is a paper in 2015 CVPR with over 280 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. DevNet: Network Architecture
  2. Gradient-based Spatial-temporal Saliency Map
  3. Experimental Results

1. DevNet: Network Architecture

DevNet: Network Architecture

1.1. Pretraining

  • The CNN is similar to AlexNet, which contains 9 convolutional layers and 3 fully-connected layers.
  • Between these two parts, a spatial pyramid pooling layer, originated in SPPNet, is adopted.
  • The weight layer configuration is: conv64-conv192-conv384-conv384-conv384-conv384-conv384-conv384-conv384-full4096-full4096-full1000.
  • The first two fully-connected layers are followed by a Dropout layer with a Dropout ate of 0.5.
  • After ImageNet pretraining, on ILSVRC2014 validation set, the network achieves the top-1/top-5 classification error of 29.7%/10.5%.

1.2. DevNet: Fine-tuning

  • Then, the softmax classifier and the last fully connected layer of the pre-trained network are removed.
  • To aggregate image-level features into the video-level representation, cross-image max-pooling is applied to fuse the outputs of the second fully-connected layer from all the key frames within the same video:
  • where sit is the ith dimension of feature vector of key frame t, and fi is the ith dimension of the video-level feature vector f.
  • The softmax loss is replaced with a more appropriate c-way independent logistic regression, which produces a detection score for each event class.
  • This video-level representation after cross-image max-pooling is as the features for the event detection task.
  • In brief, support vector machines (SVMs) and kernel ridge regression (KR) with chi^2 kernel are used as the event classifier.

2. Gradient-based Spatial-temporal Saliency Map

  • The main idea of our event recounting approach is that, given a learned detection DevNet and a class of interest, the original input image is traced by a backward pass with which we can find how each pixel affects the final detection score for the specified event class.
  • However, the class score Sc(V) is a highly nonlinear function of V, so Sc(V) is approximated by a first-order Taylor expansion expanding at V0:
  • where x is the vectorized form of the video V. Sc(V) is the detection score.
  • With the derivative of Sc(V) with respect to V at point V0 as:
  • Given a video that belongs to event class c with k key frames of size p×q, the spatial and temporal key evidences are computed.
  • The saliency score of each pixel in each key frame can be computed as:
  • where h(i, j, k) is the index of the element of wc corresponding to the image pixel in the ith row and jth column of the kth key frame.
  • Thus, for each event class, a single class-specific saliency score can be derived for each pixel in the video.
  • After obtaining the spatial-temporal saliency map, the saliency scores of all the pixels within a key frame is averaged to obtain a key-frame level saliency score.
  • Then, the key-frame level saliency scores are ranked to obtain the informative key frame.
  • For the top ranked key frames, the saliency scores are used as guidance and the graph-cut algorithm is applied to segment the spatial salient regions.

3. Experimental Results

3.1. Dataset

  • NIST TRECVID 2014 Multimedia Event Detection dataset is used.
  • This dataset contains unconstrained web videos with large variation in length, quality and resolution.
  • It also comes with ground-truth video-level annotations for 20 event categories.
  • Following the 100EX evaluation procedure, 3 different partitions are used for evaluation:
  1. Background, which contains about 5000 background videos not belonging to any of the target events.
  2. 100EX, which contains 100 positive videos for each event, are used as the training set.
  3. MEDTest, which contains 23,954 videos, is used as the test set.

3.2. Event Detection Results

  • Minimal Normalized Detection Cost (MinNDC) and Average Precision (AP) are used as metric for each event.
  • In brief, different cost values are assigned for missed detection and false alarm.
  • A lower MinNDC or a higher AP and mAP value indicates better detection performance.
Event detection results comparing with improved dense trajectory Fisher vector (IDTFV).
  • The proposed CNN-based DevNet has 5.86% improvements in terms of mean Average Precision (mAP) compared with the state-of-the-art IDTFV shallow features by averaging over all events.

3.3. Evidence Recounting Results

Comparison in terms of evidence quality against recounting percentage.
  • Two criteria were used.
  1. Evidence quality, measures how well the localized key evidences can convince the judge that a specific event occurs in the video.
  2. Recounting percent, measures how compact the video snippets are compared to the whole video.
  • A few volunteers were asked to serve as evaluators. The evaluators were first shown 1, 5, 10, 25, 50, 75 and 100 percents of the test videos separately. They voted on whether the key frames shown could convince them that it is a positive exemplar.
  • As in the figure, DevNet can reduce the recounting percentage by 15% to 25% to get the same evidence quality as the baseline method. This validates that DevNet provides reasonably good evidences for users to rapidly and accurately grasp the basic ideas of the video events.
Event recounting results comparing with the baseline approach. T means temporal key evidences and S means spatial key evidences.
  • The above table summarizes the evaluators’ preferences.
  • DevNet is better for most of the events.
Event recounting results generated by DevNet. From left to right are top one temporal key evidence, spatial saliency map, and spatial key evidence.
  • Some visual results are also shown above.

--

--

Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.