Paper Review: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Yunusemre Özköse
Multi-Modal Understanding
2 min readJul 2, 2022

In this article, I will review Scaling Egocentric Vision: The EPIC-KITCHENS Dataset.

Damen et al. [1] collect cooking videos with a first-person vision from 32 participants (10 nationalities). The dataset contains 55 hours of video, Our dataset features 55 hours, 39.6K action segments and 454.3K object bounding boxes. This egocentric view enables multi-tasking which makes the dataset more challenging. Each video contains object bounding boxes and action boundaries. Head-mounted GoPro is used to collect data.

They mentioned that all other egocentric datasets are scripted which means that all participants are told to read a script. However real life is more challenging, a chef may do multiple things at once (cooking and washing a few dishes) or change his mind during an action.

Dataset Statistics

As can be seen in the above figure, there are not only cooking annotations, the dataset also contains scenes of food preparation, washing, getting items, etc... The figure also shows sequence durations. Authors asked participants to annotate (explain actions) their own videos by themselves as a first coarse annotation since the best annotator for cooking videos can be chefs themselves.

Videos also have transcriptions. Authors collect these transcriptions via Amazon Mechanical Turk (AMT). Spoken language is not only English: 17 in English, 7 in Italian, 6 in Spanish, 1 in Greek, and 1 in Chinese. Videos in foreign languages are also transcripted to English via AMT. There is not any vocabulary limitation during narration.

Action Annotations

Annotators were also responsible for label narrated nouns.

Verbs and Nouns

They use SpaCy’s English core web model to extract verbs and nouns (Part-of-Speech). They grouped verbs into 125 classes and nouns into 331 classes.

[1] Damen, Dima, et al. “Scaling egocentric vision: The epic-kitchens dataset.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

https://epic-kitchens.github.io/2022

--

--