Anomaly Detection Workshop

Temporal Cycle-Consistency Learning

Reviewed by Ramya Balasubramaniam, Yenson Lau, and Alec Robinson

Published in

Aggregate Intellect

9 min readMay 24, 2020

This paper review is work from the participants of the Anomaly Detection workshop organized by Aggregate Intellect, and covers some advanced concepts pursued by students during the workshop.
Paper referenced:
Temporal Cycle-Consistency Learning by Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

Introduction

Temporal cycle consistency (TCC) learning is a self-supervised method that aligns videos and general sequential data by learning an embedding to capture correspondences across videos of the same action.

Compared to existing methods, TCC learning does not require explicit alignment information between sequences (which are hard to acquire), and can handle significant variations within an action category. The learned embeddings also appear to be useful for fine-grained temporal understanding of videos and action parsing, suggesting that rich and useful representations can be learned simply by looking for correspondences in sequential data.

Features and Applications

Although cycle consistency is conventionally used to find spatial correspondences for image matching and co-segmentation tasks, this work switches the emphasis to temporal consistency and presents a differentiable cycle consistency loss for self-supervised learning.

This allows the TCC model to optimise over pairs of videos to encourage cycle consistent embeddings for sequences of the same action. Compared to previous video alignment or action parsing techniques, this is a simpler approach that removes the need for manual labelling and other types of synchronisation information, that are usually difficult to acquire.

The embeddings obtained by this method are rich enough to be useful for a variety of tasks. One clear application that TCC allows is to synchronise the pace at which an action is played back across multiple videos. This alignment also enables cross-modal transfer; for example, the sound made when pouring a glass of water can be transferred from one video to another solely on the basis of its visual representation. Since the embedding is so effective at isolating the action from the training data, learned TCC representations allow for fine-grained retrieval in videos, such as differentiating before and after frames of baseball pitches depending on the positioning of the pitcher’s legs alone.

Finally, TCC learning can also be used to generate features for anomaly detection. When TCC is used to learn embeddings of video sequences of some typical behaviour (e.g. videos of bench presses), embeddings of new sequences will deviate from typical trajectories whenever anomalous activities occur.

Figure 7 from paper. When activity deviating from typical activity in the training set occurs, distance to nearest training point in the embedding space increases.

Cycle Consistent Representation Learning

Suppose we are given two video sequences S={s₁, s₂, ⋯ ,s_N} and
T={t₁, t₂, ⋯ ,t_M} with lengths N and M, respectively. Their embeddings are computed as U={u₁, u₂, ⋯, u_N} and V={v₁, v₂, ⋯ ,v_M} s.t. uᵢ=ϕ(sᵢ; θ) and
vᵢ=ϕ(tᵢ; θ), where ϕ is the neural network encoder parameterised by θ. The goal is to learn an embedding ϕ that maximises the number of cycle consistent points for any pair of sequences S,T∈D in from the data:

Cycle consistency. A point uᵢ∈U is cycle consistent iff
uᵢ = uₖ ≡ arg min{‖vⱼ−u‖: u∈U}, where vⱼ = arg min{‖ui−v‖: v∈V}.
We call uₖ the cycle-back point of uᵢ.

Figure 2 from paper. Illustration of cycle-consistency across frames in embedding space, between two corresonding videos capturing motion from the same activity.

In other words, uᵢ is cycle consistent if taking the nearest neighbour to uᵢ from V, then finding the nearest neighbour back to U, returns the same point uᵢ.

By maximising the cycle consistency of sequences describing a specific action, our embedding space captures the common structure across the videos in our dataset (i.e. the action itself) despite the presence of many confounding factors present between videos, such as angle and location.

Differentiable Cycle-back

To perform optimisation, this work presents a differentiable cycle-back procedure, which starts from uᵢ by computing the softmax values

so that αᵢ ≐ (αᵢ₁, ⋯ ,αᵢ_M) is a similarity distribution on V. Correspondingly, the soft neighbour uᵢ is given by

We cycle back to U by computing

so that βᵢ ≐ (βᵢ₁, ⋯ , βᵢ_N) is a similarity distribution on U.

Then ϕ is updated (through the embedded points U and V) to minimise two loss functions:

First, the cycle-back classification (cbc) loss is given by the cross entropy between βᵢ and the truth distribution δᵢ, where the correct entry i has value 1 and all other entries are zero:

The CBC loss measures the ability of βᵢ to be classified to the correct cycle-back point uᵢ.

Next, the cycle-back regression (cbr) loss encourages βᵢ to concentrate around the i-th entry by penalising based on its mean location μᵢ as well as its variance σᵢ²:

Here,

and λ is a trade-off parameter between the location and variance penalties.

Implementation Details

Training procedure. Sequence pairs are randomly picked from the dataset. For each sequence pair, optimization is done stochastically by picking a random frame i and descending on some linear combination of the cycle back losses above, picking frames randomly until convergence.

Encoding network. All frames in a given video sequence are resized to 244×244. Image features are then extracted from each frame using either pretrained features extracted from the Conv4c layer of ResNet-50, or from a smaller model to be trained from scratch, such as VGG-M. The resulting convolutional features have size 14×14×c, where c is either 1024 or 512 depending on whether ImageNet or VGG-M is used.

The image features of each frame are stacked with those of k−1 other context frames, and 3D convolutions are applied to aggregate temporal information. This is followed by 3D max pooling, two 512×512 fully-connected networks, and a linear projection layer to produce the final embedding space, which has dimension 128; details are shown in the table below.

Table 1 from paper. Architecture of the embedding network.

Evaluation

Three evaluation measures are used to evaluate fine-grained understanding of the model on a given action. TCC networks are first trained and then frozen. SVM classification and linear regression is performed on the features from the networks, with no additional fine-tuning of the networks. Higher scores indicate better performance for each measure:

Phase classification accuracy: Whether the TCC features allow the correct phase to be identified by training an SVM classifier on phase labels of the training data.

Phase progression: How well the “progression” of a process or action was captured by the embedding. Progression is measured as the fraction of time-stamps passed in the current frame between key events. Linear regression is applied to predict progression values on TCC features. Performance is computed as the the average R-squared measure,

where yᵢ is the ground truth progress value, y̅ᵢ is the average of yᵢs and ŷᵢ is the prediction made. The maximum value of this measure is 1.

Kendall’s Tau is a statistical measure that determines how well-aligned two sequences are in time, and does not require additional labels for evaluation. For a pair of videos, each pair of embeddings (uᵢ, uⱼ) in the first video are used to retrieve their nearest embeddings in the second video, (vₚ, v_q). This quadruplet of frame indices (i;j;p;q) are said to be concordant if i<j and p<q or i>j and p>q; otherwise they are discordant. Kendall’s Tau is then calculated over all embedding pairs from the first video,

Here n denotes the length of the first sequence and the denominator represents the total number of pairs present in the first video. A value of 1 implies that videos are perfectly aligned, while a value of -1 implies that the videos are aligned in reverse.

When τ_{u; v} is averaged over all pairs of videos in the validation set, it serves as a measure of how well TCC learning generalizes for aligning new sequences. It assumes, however, that there are no repetitive frames in a video, and may not accurately reflect desired performance when videos taken involve slow or repetitive motion. Neither of these issues are present in the datasets used for this work.

TCC was compared with existing self-supervised video representation learning methods:

Shuffle and Learn (SaL): Triplets of frames are randomly shuffled and a small classifier is trained to predict if the frames were in order or shuffled. The labels for training this classifier are derived from the indices of the triplet sampled. This loss encourages representations that encode information about the order in which an action should realistically be performed.
Time-Contrastive Networks (TCN): n frames are sampled from the sequence and used as anchors (in the sense of metric learning). For each anchor, positives are sampled within a fixed time window, resulting in nn-pairs of anchors and positives. The n-pairs loss considers all other possible pairs as negatives. This loss encourages representations to be disentangled in time while still adhering to metric constraints.
Combined Losses: The cycle consistency loss can be combined with all SaL and TCN losses to get more training methods. The authors learn embeddings using γ⋅TCC+(1−γ)⋅SaL and γ⋅TCC+(1−γ)⋅TCN, where γ is chosen by searching from the set {0.25,0.5,0.75}. The video encoder architecture remains the same.

Ablation of Different Cycle Consistency Losses

The phase classification, phase progression, and Kendall’s Tau metrics were measured on the Pouring data set using TCC features trained on either the cycle-back classification (CBC), cycle-back regression (CBR) or mean square error (MSE) losses exclusively. The MSE loss is equivalent to the CBR loss without variance penalization, i.e. λ=0. Based on the results below, the CBR loss with λ=0.001 outperforms all other losses, and is used for the remaining experiments.

Action Phase Classification

Self-supervised learning with trained vs. fine-tuned encoding networks. The authors compared phase classification results when TCC features were learned on either a) a smaller VGG-M encoder network trained from scratch, or b) a fine-tuned, pretrained ResNet-50 network.

Using TCC features for phase classification appears to be generally superior to a supervised classification approach, regardless of the choice of encoder network. This is expected since there are few labels available in the training data. With the VGG encoder, TCC features provided the best phase classification results regardless of dataset. This might be attributed to the fact that TCC learned features across multiple videos during training but SaL and TCN losses operated on frames from a single video only.

The scarcity of training data also means that features trained on the fine-tuned ResNet-50 encoder generally achieve higher performance, and are used for all remaining experiments. In this case, SaL, TCN, and TCC each are able to yield competitive features for phase classification. For the Pouring dataset, TCC features yielded the best performance, whereas TCC + TCN yielded the best performance on the Penn Action dataset. The authors speculate that the combination of losses may have reduced overfitting in the latter case.

Self-supervised Few Shot Learning. In this setting, many training videos are available but per-frame labels are only available for a few of them — in this case, each labelled video contains hundreds of labelled frames. Self-supervised features are learned on the entire training set using a fine-tuned ResNet-50 encoder. These are compared against features extracted using supervised learning on videos for which labels are available. The goal is to see how phase classification accuracy increases with respect to the number of labelled videos. Based on the results on the Golf Swing and Tennis Serve videos from the Penn Action dataset, TCC + TCN features are able to extract enough information to outperform supervised learning approaches until roughly half the dataset is labelled. This suggests that there is a lot of untapped signal present in the raw videos which can be harvested using self-supervision techniques.

Phase Progression and Kendall’s Tau

The remaining tasks measure the effectiveness of representations at a more fine-grained level than phase classification. In general, features obtained using TCC alone or in combination with another self-supervised loss leads to best performance in both tasks. Moreover, significantly higher values for Kendall’s Tau are obtained with features extracted TCC + TCN losses.

Conclusion

The TCC method was able to learn features useful for temporally fine grained tasks. In multiple experiments, it is observed that TCC, being a self-supervised method, enjoyed significant performance boosts when there is a lack of labelled data. With only one labelled videos, TCC achieved similar performance to existing supervised learning models trained with roughly 50 labelled videos. Based on its numerous possible applications, TCC can serve as a general-purpose temporal alignment method that works without any labels, benefiting any task which relies on alignment.