Review — I3D: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (Video Classification)

Using Inflated GoogLeNet/Inception-V1 for Two-Stream 3D ConvNets, Outperforms Deep Video, Two-Stream ConvNet, TSN, C3D, etc.

Sik-Ho Tsang
Nerd For Tech


Are these actors about to kiss each other, or have they just done so?

In this story, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, (I3D), by DeepMind, and University of Oxford, is reviewed. In this paper:

  • Two-Stream Inflated 3D ConvNet (I3D) is designed based on 2D ConvNet inflation: Filters and pooling kernels are expanded into 3D.
  • Seamless spatio-temporal features are learnt while leveraging successful ImageNet architecture designs and even their parameters.

This is a paper in 2017 CVPR with over 3000 citations. (Sik-Ho Tsang @ Medium)


  1. Prior Network Architectures
  2. Techniques for Inflated 3D ConvNet (I3D)
  3. Proposed Two-Stream Inflated 3D ConvNets (I3D)
  4. The Kinetics Human Action Video Dataset
  5. Experimental Results

1. Prior Network Architectures

Prior Network Architectures

1.1. Naïve Approach

  • This can be achieved by using them to extract features independently from each frame then pooling their predictions across the whole video.
  • This is in the spirit of bag of words image modeling approaches but while convenient in practice, it has the issue of entirely ignoring temporal structure (e.g. models can’t potentially distinguish opening from closing a door). That is the problem in the first figure at the top of the story.

1.2. (a) ConvNet+LSTM

  • To deal with the above problem, a LSTM layer with batch normalization is added after the last average pooling layer of Inception-V1, with 512 hidden units.
  • A fully connected layer is added on top for the classifier.

1.3. (b) 3D-ConvNet

  • In this paper, a small variation of C3D is implemented, which has 8 convolutional layers, 5 pooling layers and 2 fully connected layers at the top.
  • The inputs to the model are short 16-frame clips with 112 × 112-pixel crops as in the original implementation.
  • Differently from [29], batch normalization is used after all convolutional and fully connected layers.

1.4. (c) (d) Two-Stream Network Using 2D-ConvNet/3D ConvNet

  • Two-Stream ConvNet [25] models short temporal snapshots of videos by averaging the predictions from a single RGB frame and a stack of 10 externally computed optical flow frames, after passing them through two replicas of an ImageNet pre-trained ConvNet.
  • A recent extension [8] fuses the spatial and flow streams after the last network convolutional layer.
  • In this paper, authors approximates them using Inception-V1 as backbone.
  • The inputs to the network are 5 consecutive RGB frames sampled 10 frames apart, as well as the corresponding optical flow snippets.
  • The spatial and motion features before the last average pooling layer of Inception-V1 (5 × 7 × 7 feature grids, corresponding to time, x and y dimensions) are passed through a 3×3×3 3D convolutional layer with 512 output channels, followed by a 3 × 3 × 3 3D max-pooling layer and through a final fully connected layer.

2. Techniques for Inflated 3D ConvNet (I3D)

Proposed Inflated 3D ConvNet (I3D)
  • To convert the 2D ConvNet into 3D counterpart, some techniques are required or some issues need to be concerned.

2.1. Inflating 2D ConvNets into 3D

  • Starting with a 2D architecture, and inflating all the filters and pooling kernels, filters are typically square and make them as cubic — N × N filters become N × N × N.

2.2. Bootstrapping 3D filters from 2D Filters

  • An image can be converted into a (boring) video by copying it repeatedly into a video sequence.
  • This can be achieved, thanks to linearity, by repeating the weights of the 2D filters N times along the time dimension, and rescaling them by dividing by N.
  • By doing so, the outputs of pointwise non-linearity layers and average and max-pooling layers are the same as for the 2D case.

2.3. Pacing receptive field growth in space, time and network depth

  • A symmetric receptive field is however not necessarily optimal when also considering time — this should depend on frame rate and image dimensions.
  • If it grows too quickly in time relative to space, it may conflate edges from different objects breaking early feature detection, while if it grows too slowly, it may not capture scene dynamics well.
  • For the model in this paper, the input videos were processed at 25 frames per second; it is found that it is helpful to not perform temporal pooling in the first two max-pooling layers.

3. Proposed Two-Stream Inflated 3D ConvNets (I3D)

The Inflated Inception-V1 architecture (left) and its detailed inception submodule (right).
  • The above shows the inflated Inception-V1 and its corresponding Inception module.
  • One I3D network trained on RGB inputs, and another on flow inputs.
  • It is found that it is valuable to have a two-stream configuration, with one I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow information.
  • Two networks are trained separately and averaged their predictions at test time.
Number of parameters and temporal input sizes of the models
  • Finally, all the above models are tested.

4. The Kinetics Human Action Video Dataset

4.1. Kintetics

  • In this dataset, the list of action classes covers: Person Actions (singular), e.g. drawing, drinking, laughing, punching; Person-Person Actions, e.g. hugging, kissing, shaking hands; and, Person-Object Actions, e.g. opening presents, mowing lawn, washing dishes.
  • The dataset has 400 human action classes, with 400 or more clips for each class, each from a unique video. The clips last around 10s, and there are no untrimmed videos.
  • The test set consists of 100 clips for each class. (More details are in [16] for this dataset.)

4.2. MiniKinetics

  • In this paper, a smaller dataset than the full Kinetics, called miniKinetics.
  • This is an early version of the dataset having only 213 classes with a total of 120k clips across three splits, one for training with 150–1000 clips per class, one for validation with 25 clips per class and one for testing with 75 clips per class.
  • MiniKinetics enabled faster experimentation, and was available before the full Kinetics dataset.

5. Experimental Results

5.1. Different Model Architectures

Architecture comparison: (left) training and testing on split 1 of UCF-101; (middle) training and testing on split 1 of HMDB-51; (right) training and testing on miniKinetics.
  • It can be shown that, the proposed new I3D models do best in all datasets, with either RGB, flow, or RGB+flow modalities.

5.2. Effects of Pretraining Using MiniKinetics

Performance on the UCF-101 and HMDB-51 test sets (splits 1 of both) for architectures pre-trained on miniKinetics
  • Original: train on UCF-101 / HMDB-51
  • Fixed: features from miniKinetics, with the last layer trained on UCF-101 / HMDB-51.
  • Full-FT: miniKinetics pre-training with end-to-end fine-tuning on UCF-101 / HMDB-51
  • Δ shows the difference in misclassification as percentage between Original and the best of Full-FT and Fixed.
  • The clear outcome is that all architectures benefit from pre-training either by Fixed or Full-FT.

5.3. SOTA Comparison

Comparison with state-of-the-art on the UCF-101 and HMDB-51 datasets, averaged over three splits.
  • Either of the proposed RGB-I3D or RGB-Flow models alone, when pre-trained on Kinetics, outperforms all previous published performance by any model or model combinations, such as Deep Video (Deep Networks), Two-Stream ConvNet, TSN, C3D.
  • The proposed combined two-stream architecture widens the advantage over previous models considerably, bringing overall performance to 97.9 on UCF-101 and 80.2 on HMDB-51, which correspond to 57% and 33% misclassification reductions, respectively compared to the best previous model.



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.