[Paper] Deep Video: Large-scale Video Classification With Convolutional Neural Network (Video Classification)

Outperforms Hand-Crafted Feature Approaches

Published in

The Startup

5 min readNov 1, 2020

In this story, Large-scale Video Classification with Convolutional Neural Network (Deep Video), by Google Research, and Stanford University, is presented. In this paper:

A spatio-temporal network using CNN is designed.
Different fusion strategies and different input resolution strategies are tried for fusing multiple frames together for video classification.
This paper should be one of the first papers to work on video classification using CNN.

This is a paper in 2014 CVPR with over 4900 citations. (Sik-Ho Tsang @ Medium)

Outline

Fusion Strategies
Input resolution strategies
Experimental Results

1. Fusion Strategies

**Fusion Strategies (Red, green and blue boxes indicate convolutional, normalization and pooling layers respectively.)**

1.1. Single-frame

AlexNet is used.
C(96 ,11, 3)-N-P-C(256, 5, 1)-N-P-C(384, 3, 1)-C(384, 3, 1)-C(256, 3, 1)-P-FC(4096)-FC(4096), where C(d, f, s) indicates a convolutional layer with d filters of spatial size f×f, applied to the input with stride s.
FC(n) is a fully connected layer with n nodes.
All pooling layers P pool spatially in non-overlapping 2×2 regions.
All normalization layers N are LRN used in AlexNet.
The final layer is connected to a softmax classifier with dense connections.
(If interested, please feel free to read AlexNet.)

1.2. Early Fusion

This is implemented by modifying the filters on the first convolutional layer in the single-frame model by extending them to be of size 11×11×3×T pixels, where T is some temporal extent (T = 10, or approximately a third of a second).
The early and direct connectivity to pixel data allows the network to precisely detect local motion direction and speed.

1.3. Late Fusion

Two separate single-frame networks are used.
Two streams are merged in the first fully connected layer.
The first fully connected layer can compute global motion characteristics by comparing outputs of both towers.

1.4. Slow Fusion

The Slow Fusion model is a balanced mix between the two approaches that slowly fuses temporal information throughout the network such that higher layers get access to progressively more global information in both spatial and temporal dimensions.
Temporal extent T = 4 is used on an input clip of 10 frames through valid convolution with stride 2 and produces 4 responses in time.

2. Input resolution strategies

One approach to speeding up the networks is to reduce the number of layers and neurons in each layer.
Instead of reducing the size of the network, we can conduct experiments on training with images of lower resolution. However, the high-frequency detail in the images proved critical to achieving good accuracy.

2.1. Fovea Stream

A fovea stream processes high-resolution center crop.
The fovea stream receives the center 89×89 region at the original resolution.

2.2. Context Stream

A context stream models low-resolution image.
The context stream receives the downsampled frames at half the original spatial resolution (89×89 pixels).
Both streams are processed by identical network as the full frame models.

3. Experimental Results

All images are preprocessed by first cropping to center region, resizing them to 200×200 pixels, randomly sampling a 170×170 region, and finally randomly flipping the images horizontally with 50% probability.

3.1. Sports-1M

**Results on the 200,000 videos of the Sports-1M test set.**

The Sports-1M dataset consists of 1 million YouTube videos annotated with 487 classes.
The dataset is split by assigning 70% of the videos to the training set, 10% to a validation set and 20% to a test set.
Deep Video approaches consistently and significantly outperform the feature-based baseline.
The single-frame model already displays strong performance.
The foveated architectures are between 2–4× faster in practice.
It is observed that there is a speedup during training of 6 to 21 clips per second (3.5×) for the single-frame model and 5 to 10 clips per second (2×) for the Slow Fusion model.
The Slow Fusion network is chosen as a representative motion-aware network because it performs best.

**Filters learned on first layer of a multiresolution network. Left: context stream, Right: fovea stream.**

Interestingly, the context stream learns more color features while the high-resolution fovea stream learns high frequency grayscale filters.

**Some Examples of Predictions on Sports-1M test data**

3.2. UCF-101

**Results on UCF-101 for various Transfer Learning approaches using the Slow Fusion network**

The dataset consists of 13,320 videos belonging to 101 categories that are separated into 5 broad groups: Human-Object interaction (Applying eye makeup, brushing teeth, hammering, etc.), Body-Motion (Baby crawling, push ups, blowing candles, etc.), Human-Human interaction (Head massage, salsa spin, haircut, etc.), Playing Instruments (flute, guitar, piano, etc.) and Sports.
Fine-tune top layer: Train a classifier on the last 4096-dimensional layer, with dropout regularization.
Fine-tune top 3 layers: Train the top 3 layers.
Fine-tune all layers: Train all layers.
Train from scratch: Train all layers from scratch using UCF-101.
The best performance is obtained by taking a balanced approach and retraining the top few layers of the network.

**Mean Average Precision of the Slow Fusion network on UCF-101 classes broken down by category groups**

The performance is broken down into 5 broad groups of classes.
The average precision of every class and the mean average precision over classes in each group are computed.
The gain in performance when retraining only the top to retraining the top 3 layers is almost entirely due to improvements on non-Sports categories.

Reference

[2014 CVPR] [Deep Video]
Large-scale Video Classification with Convolutional Neural Networks

Video Classification

[Deep Video] [C3D]