@Transformers for Vision/RepNet

Maharshi Yeluri
The Startup
Published in
7 min readJul 23, 2020

--

Class Agnostic Video Repetition Counting (Repeating Net😵)

Neural networks are proven to be powerful and became an industry stranded computer application in this decade. Transfer learning is very well adopted in the fields of vision which is a step towards generalization. The bar for the neural networks will definitely go up in the coming decade and more complex cognitive tasks should be ready to challenge the neural networks. This paper is one of such, a team from Google AI and Deep mind tried to solve the problem of counting the repetitions in a cycle(Heart beats, planetary rotations). The following GIF depicts the task more clearly.

Looking at the above results the applications of this approach seem endless.

RepNet architecture:

The following image is a detailed illustration of repnet architecture

This diagram is very detailed and self-explanatory. Let’s break down the architecture

  1. Backbone Encoder
  2. Temporal Self Similarity
  3. Intermediate Multi-head attention
  4. Periodicity estimation

Backbone Encoder

Assume we are given a video(V) as a sequence of N frames. Each frame of the video is passed into resnet-50 to extract 2D convolution features and the model process only 64 frames of a video at once(we will get back to this). The input frames are [112×112×3] in size and the output from 2D feature map is of size [7 × 7 × 1024] which can treated as 1024 spatial features each of size [7 × 7 ] for each of the input frames. We pass these convolutional features through a layer of 3D convolutions to add local temporal information to the per-frame features. We use 512 filters of size [3 × 3 × 3] with ReLU activation with a dilation rate of 3 where has the dilation rate is the number of frames we consider at once to apply 3D convolution. We reduce the dimensionality of extracted Spatio-temporal features by using Global 2D Max-pooling over the spatial dimensions and to produce embedding vectors each of length 512.

Temporal Self Similarity

Temporal self-similarity (TSM) is a simple yet powerful aspect of this architecture. Temporal self-similarity is a matrix of similarities between each frame against every other frame. We use the negative of the squared euclidean distance as the similarity function, f(a, b) = −||a − b||², followed by row-wise softmax operation(not sure about the minus sign before L2, anyway after softmax it doesn’t matter).

As the TSM has only one channel(64*64), it acts as an information bottleneck in the middle of our network. TSMs also make the model temporally interpretable which brings further insights to the predictions made by the model. The following are some cool visualizations of TSM’s in 3 different cases.

From the above TSM’s diagonal values are always high because of the self frame similarity whereas the nondiagonal values are pretty much different from the other cases. Let’s try to understand what’s happening in each case closely

case i: similarity matrix looks consistent because the periodic motion is constant through-out the frames

case ii: In the first quarter of the matrix the separation between yellow and blue is consistently decreasing as the period of the ball is decreasing and eventually becoming yellow as the ball barely bounces making all the frames similar.

case iii: periodic motion preceded and succeeded by no motion (waiting to mix concrete, mixing concrete, stopped mixing)

Intermediate Multi-head attention

A complex model is needed to predict the period and periodicity from such diverse self-similarity matrices. 2D convolutional filters of size 3 × 3, followed by a transformer layer which uses multi-headed attention with trainable positional embedding’s in the form of a 64 length variable that is learned by training. We use 4 heads with 512 dimensions in the transformer with each head being 128 dimensions in size. This kind of allows the network to learn higher-order correlations like contribution of a frame to the periodicity and how fast and slow the periodicity is changing over the frames from the actual TSM.

Periodicity estimation

The final and most important piece in the puzzle is the prediction, we focus on predicting two things:

(i) Repetition counting: identifying the number of repeats in the video. Which is done by estimating per frame period lengths, and then converting them to a repetition count. Our period length estimator outputs per frame period length estimation, where the classes are discrete period lengths L = {2, 3, …, 32} (why is there no 1 in the classes?) [L≤N/2] where N is the number of input frames. We use a multi-class classification objective (softmax cross-entropy) for optimizing our model

(ii) Periodicity detection: identifying if the current frame is a part of a repeating temporal pattern or not. We approach this as a per-frame binary classification problem. A visual explanation of these tasks is in the below diagram. This should help in dropping the repetition counting if the frame is not a part of periodic motion(first and last few frames are not a part of periodic motion in the following diagram) and uses a binary cross-entropy for optimization

The transformer layer gives an output of size [64 × 512× 64]. we then do a global average pooling on the 512-dimensional vector which is then passed into a 2 different fully connected Linear layers, where each gives an output of size [64 × 1]

Inference

We sample consecutive non-overlapping windows of N frames and provide it as input to RepNet which outputs per-frame periodicity pi and period lengths li. We define per-frame count as pi/ li where pi {0,1}. The overall repetition count is computed as the sum of all per-frame counts where i=[1, N] Σpi/ li (E.g. For instance, let’s assume all the 64 frames are involved in a periodic motion then pi=[1]*64 and the periodic length is 2 throughout the frames li=[2]*64 then the number of repetitions are sum of ([1/2]*64 )) which is equal to 32, also to explain the exclusion of class 1 in the periodicity labels let’s take the above example with an extreme case, assume the periodic length is 1 throughout the frames li=[1]*64, then the final number of repetitions in a 64 frames video is equal to 64😂, doesn’t make sense right! that’s the reason, we don’t want a single frame to be counted as repetition.

Multi-speed evaluation

As the model can predict only period lengths up to 32, for covering much longer period lengths we sample input video with different frame rates. (i.e. we play the video at 1×, 2×, 3×, and 4× speeds). We choose the frame rate which has the highest score for the predicted period. The following is an example of such.

The quasi-sinusoidal graph is a 1D PCA projection of the encoder features over time

what about repetitions with a longer temporal context?. Many repeating phenomena occur over a longer temporal scale (in the order of days or years). Even though the model has been trained on short videos (∼10s), it can still work on videos with slow periodic events by automatically choosing a higher input frame stride(N). E.g. period length of a day from videos of the earth captured by satellites.

One model, many domains and applications

A single model is capable of performing these tasks over videos from many diverse domains (animal movement, physics experiments, humans manipulating objects, people exercising, child swinging) in a class agnostic manner

Synthetic data generation

Authors also used the training set of Kinetics dataset without any labels where they have sampled a clip C of random length P frames from V. This clip C is repeated K times (where K > 1) to simulate videos with repetitions by randomly concatenating the reversed clip before repeating to simulate actions where the motion is done in reverse in the period (like jumping jacks). Then, they pre-pend and append the repeating frames with other non-repeating segments from V, which are just before and after C, respectively. In addition to this, they also performed Camera motion augmentation which is a kind of affine transformation on the frames for more robust synthetic data generation.

Conclusion

This model successfully detects periodicity and predicts counts over a diverse set of actors (objects, humans, animals, the earth) and sensors (standard camera, ultrasound, laser microscope). The practical applications of this study are many folds

References

First Author of this paper published a blog with many examples and visualizations, I would higly encourage to check the following URL

https://sites.google.com/view/repnet

--

--