Action Recognition Paper note: TS_LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
Abstract
Despite the success of two-stream deep Convolutional Neural Networks, methods extending the basic two-stream ConvNet have not systematically explored possible network architectures to further exploit spatiotemporal dynamics within video sequences. Further, such networks often use different baseline two-stream networks.
In this work, we first demonstrate a strong baseline two-stream ConvNet using ResNet-101. We use this baseline to thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting spatiotemporal information.
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
Introduction
Each two-stream ConvNets work uses different networks for the baseline two-stream approach, with varied performance depending on training and testing procedure as well as the optical flow method used.
In this paper, we would like to answer the question: given the spatial and motion features representations over time, what is the best way to exploit the temporal information?
Our contributions:
- Temporal Segment LSTM (TS-LSTM): we revisit the use of LSTMs to fuse high-level spatial and temporal features to learn hidden features across time.
- Temporal-ConvNet: we propose to use stacked temporal convolution kernels to explore information at multiple scales.
Approach
We specifically focus on two models that can be used to process temporal data: Temporal Segment LSTMs (TS-LSTM) which leverage recurrent networks and convolution over temporally-constructed feature matrices (Temporal-ConvNet).
Two-stream ConvNets
In our framework, the two-stream ResNets serve as high-dimensional feature extractors. The input feature vector x for our proposed temporal segment LSTM and Temporal-ConvNet is the concatenation of spatial-stream dim and temporal-stream dim.
Spatial stream:
Using a single RGB image for the spatial stream has been shown to achieve fairly good performance.
Temporal stream:
Stacking 10 optical flow images for the temporal steam has been considered as a standard for two-stream ConvNets.
Temporal Segment LSTM (TS-LSTM)
We adapted temporal segments for use with RNNs and provide segmental consensus via temporal pooling and LSTM cells.
Temporal-ConvNet
We adapt the ConvNet architecture on feature matrices x. The overall architecture of the Temporal-ConvNet is composed of multiple Temporal-ConvNet layers (TCLs).

