Paper Review: Towards Automatic Learning of Procedures from Web Instructional Videos

Yunusemre Özköse
Multi-Modal Understanding
3 min readJul 2, 2022

In this article, I will review Towards Automatic Learning of Procedures from Web Instructional Videos.

Zhou et al. [1] introduce a recipe dataset that contains videos of cooking some foods and corresponding recipe information. It is called YouCook2. Since video data is more informative than image data, I think it is more complicated to handle this type of data. It should be utilized as much as possible.

Procedural videos contain a lot of steps that build things. For example, a recipe video contains many steps to make dinner. Authors call these steps as segment. A segment may contain more than one action, but it should be explainable with one sentence.

Mainly they focus on procedure segmentation, introduce a new dataset: YouCook2 and propose ProcNets to propose video segments from a given video. The dataset contains 2000 videos from 89 recipes, totally 176 hours. The proposed model is free of number of segment knowledge or available subtitles.

Procedural segmentation is very similar to event proposing, but event proposing aims to extract uncategorized temporal segments from the given video. Hence recall is more important in this task. However, procedure segmenting aims to identify segments that are actually a set member.

Automatic procedure segmentation method

There are 3 main components of the proposed model:

  1. Context-Aware Video Embedding: Video frames are fed to ResNet model and ResNet extracts Lx512 dimensional feature matrix where L is number of frames. L is set to 500. Then a bi-directional LSTM is used to obtain context encodings. After that, Resnet features, forward and backward outputs of LSTM are concatenated. Finally, dimension is reduced to Lx512 with a linear layer. Final Lx512 matrix is called frame-wise context-aware features.
  2. Procedure Segment Proposal: They barrow anchoring idea from faster R-CNN. There are k number of anchor that is applied to context-aware features. So, if we have Lx512 features and apply anchoring, we obtain kxLx3 matrix which stands for segment proposal. Each proposal has proposal score, center and length offsets. Sigmoid and tanh are applied to proposal score and offsets respectively. During training, binary cross-entropy is used with proposal scores and L1 loss is applied to offsets.
  3. Sequential Prediction: Obtained procedure scores and offsets are converted to proposal vector (flattened) and this vector is fed to another LSTM with Resnet features. After that, beam search is applied to find the best sequence for proposed segment. Empirically, reported best beam size is 1, it is actually greedily selecting the best one :).

Loss

There are 3 parts of the loss function. The first one is from proposal network. They select positive samples with >0.8 UoI scores and negative samples with <0.2 UoI scores. The second one is again from proposal network. They use L1 loss for offsets of proposals. The third one is negative log-likelihood for LSTM outputs.

References

[1] Zhou, Luowei, Chenliang Xu, and Jason J. Corso. “Towards automatic learning of procedures from web instructional videos.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[2] http://youcook2.eecs.umich.edu

--

--