Action Recognition Paper note: TS_LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

darrenyaoyao
Jul 24, 2017 · 2 min read

Abstract

Despite the success of two-stream deep Convolutional Neural Networks, methods extending the basic two-stream ConvNet have not systematically explored possible network architectures to further exploit spatiotemporal dynamics within video sequences. Further, such networks often use different baseline two-stream networks.

In this work, we first demonstrate a strong baseline two-stream ConvNet using ResNet-101. We use this baseline to thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting spatiotemporal information.

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

Introduction

Each two-stream ConvNets work uses different networks for the baseline two-stream approach, with varied performance depending on training and testing procedure as well as the optical flow method used.

In this paper, we would like to answer the question: given the spatial and motion features representations over time, what is the best way to exploit the temporal information?

Our contributions:

  1. Temporal Segment LSTM (TS-LSTM): we revisit the use of LSTMs to fuse high-level spatial and temporal features to learn hidden features across time.
  2. Temporal-ConvNet: we propose to use stacked temporal convolution kernels to explore information at multiple scales.

Approach

We specifically focus on two models that can be used to process temporal data: Temporal Segment LSTMs (TS-LSTM) which leverage recurrent networks and convolution over temporally-constructed feature matrices (Temporal-ConvNet).

Two-stream ConvNets

In our framework, the two-stream ResNets serve as high-dimensional feature extractors. The input feature vector x for our proposed temporal segment LSTM and Temporal-ConvNet is the concatenation of spatial-stream dim and temporal-stream dim.

Spatial stream:
Using a single RGB image for the spatial stream has been shown to achieve fairly good performance.

Temporal stream:
Stacking 10 optical flow images for the temporal steam has been considered as a standard for two-stream ConvNets.

Temporal Segment LSTM (TS-LSTM)

We adapted temporal segments for use with RNNs and provide segmental consensus via temporal pooling and LSTM cells.

Temporal-ConvNet

We adapt the ConvNet architecture on feature matrices x. The overall architecture of the Temporal-ConvNet is composed of multiple Temporal-ConvNet layers (TCLs).

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade