Action Recognition Paper Note: Convolutional Two-Stream Network Fusion for Video Action Recognition


We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information.

  1. that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution ayer without loss of performance, but with a substantial saving in parameters
  2. that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy
  3. that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.


The two-stream architecture is not able to exploit two very important cues for action recognition in video:

  1. recognizing what is moving where with optical flow recognition
  2. how these cues evolve over time

Our objective in this paper is to rectify this by developing an architecture that is able to fuse spatial and temporal cues at several levels of granularity in feature abstraction, and with spatial as well as temporal integration.


Spatial fusion

Our intention is to fuse the two networks such that channel responses at the same pixel position are put in correspondence.

Sum fusion, Max fusion, Concatenation fusion, Conv fusion, Bilinear fusion

Temporal fusion

3D Pooling, 3D Conv + Pooling

Proposed architecture

We fuse the two networks, at the last convolutional layer (after ReLU) into the spatial stream to convert it into a spatiotemporal stream by using 3D Conv fusion followed by 3D pooling. Moreover, we do not truncate the temporal stream and also perform 3D Poolling in the temporal network.