Paper Review: Learning Temporal Video Procedure Segmentation from an Automatically Collected Large Dataset

Yunusemre Özköse
Multi-Modal Understanding
4 min readJul 4, 2022

Introduction

In a video, there might be several actions like cooking, walking, swimming, eating, etc... These actions might naturally be following to another one like eating after cooking. At this point, the video contains several segments, ex.: 1)preparing tomatoes 2)cutting potatoes 3)mixing them 4)cooking 5)eating. The video segmentation task is defined as segmenting these actions in the given video. Authors define 2 sub-task of video segmentation: Video Action Segmentation (VAS) and Video Procedure Segmentation (VPS).

The action segmentation task requires a set of action names and separates a video into segments with predicted action labels. However, procedure segmentation doesn’t need labels and is in charge of predicting only action boundaries. In this paper, the authors introduce a procedural video dataset and propose a procedure segmentation method.

TIPS Dataset

There are 4 key points of the proposed dataset:

  • Scale: It is the largest procedural dataset up to now.
  • Diversity: There are multiple domains like cooking, gardening, etc…
  • Contiguity: YouCook2 segments might be different parts of the video, and there might be gaps between each segment in the video. However, the proposed dataset guarantees that segments are following the other one.
  • Auto-generated: Dataset is automatically collected via a pipeline.

It is so fancy that the dataset is automatically generated with a pipeline, we can extend the dataset for a specific domain such as cooking. I know this example is contrary to the second point, but if a researcher focuses on a domain, it is very useful to extend this dataset in an automatic manner.

Workflow of collecting data automatically:

  1. Collecting instructional videos: search the video title with keywords like ”How to” or ”ways to” etc. as instructional videos.
  2. Select well-organized videos: match the speech text that explicitly mention the keywords like “step 1”, “step two”, and so on.
  3. Construct segment labels: After selecting videos with the first 2 steps, extract annotations from metadata. Youtube videos already have timestamps of segments. Authors map steps to these segments. Hence labels are automatically extracted.

The number of samples for the top 20 categories is shown below.

Multi-modal Transformer with Gaussian Boundary Detection (MT-GBD)

The proposed model MT-GBD aims to separate each procedural step in the given video. Pre-trained SlowFast ResNeXt-101 is used to extract features from the given video where video ∈ R(Mxwxhxc) and V(f) = feature ∈ R(Mxd’). Then semantic feature is extracted with

E(p) is positional embedding and added to visual feature for encoding position information. Then text/transcription features are extracted with BERT model:

In the end, V ∈ R(Mxd) and L ∈ R(Nxd) where M is maximum video length, N is maximum transcription length, and d is feature dimension. After that, both visual and textual features are given to self-attention layers. Then multi-had attention module is used to fuse visual and textual parts. Then convolution layers are applied to video masks of self-attention. Finally, these outputs are mapped to a final layer R(Mx2) where there are 2 different output types: boundary/non-boundary. They use gaussian boundry detection during training and non-maximum suppression during test time. The overall architecture is shown below.

Note

I couldn’t find the website of the dataset, the paper is published recently and I think the authors will make it publically available soon.

[1] Ji, Lei, et al. “Learning Temporal Video Procedure Segmentation from an Automatically Collected Large Dataset.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022.

--

--