Ego4D : Training AI to Understand Human Activities

Roshni Ramnani
Antaeus AR
Published in
4 min readFeb 15, 2024
Fig1 : Ego4D is a massive-scale egocentric video dataset of daily life activity spanning 74 locations worldwide. See https://ego4d-data.org/fig1.html

Imagine having an AI system that can guide you during your first tennis lesson , give you some helpful tips while you’re cooking or help you find your lost keys.

Introduction

If you share my sentiment, you would likely be intrigued by the range of robot demonstrations we’ve witnessed this year, such as the Mobile-Aloha robot [1] or Eve by 1X Technologies [2]. These demonstrations showed how robots can perform regular human activities — rearranging items, cooking etc.

But, what if we flipped the narrative and explored how AI could understand, and perhaps learn from, human activities instead of solely being trained to perform them? This AI could be integrated into an android form or a non-android device like a handheld device equipped with a camera and appropriate sensors.

Hold on — we’re not quite there yet, but I had a thought: what if we explored existing research moving in this direction? One notable initiative is a comprehensive curated dataset known as Ego4D [3].

This dataset comprises over 3000 hours of human activities and processes, essentially capturing the essence of human behavior. Primarily, the dataset involves the creation of hours of video content depicting everyday activities from a personal, egocentric perspective.

When it came to training large Language Models, the data already available on the World Wide Web was sufficient to create very resourceful foundation models capable of performing a range of tasks in the NLP (Natural language processing) domain.

Granted, data cleanup was necessary, along with significant innovation in architectures and training methods. However, when it comes to training multimodal models for the human activity understanding we are referring to, the freely available existing data is simply not enough. For one, all the data present in the form of videos is taken from a third-person point of view.

Additionally, understanding the surroundings from a first-person view means recognizing objects, actions, and social behaviors in a way that makes sense to humans. Hence, there is a need to create very specialized datasets for this.

The Ego4D dataset

The Ego4D dataset is was released by the EGO4D Consortium let by 13 universities and Meta. Ego4D consists of 3,670 hours of video collected by 931 unique participants from 74 locations worldwide in 9 different countries.

The vast majority of the footage is unscripted and “in the wild,” representing the natural interactions of the camera wearers as they go about daily activities in their homes, workplaces, leisure, social settings, and while commuting.

Based on self-identified characteristics, the camera wearers come from diverse backgrounds, occupations, genders, and ages. Additionally, the participants perform a narration in natural language.

The dataset essentially captures how people spend the bulk of their time in the home (e.g., cleaning, cooking, yardwork), leisure (e.g., crafting, games, attending a party), transportation (e.g., biking, car), errands (e.g., shopping, walking dog, getting car fixed), and in the workplace (e.g, talking with colleagues, making coffee).

The participants ( whoose PII data was removed ) were asked to wear the camera at length (at least as long as the battery life of the device) so that the activity would unfold naturally in a longer context. A typical raw video clip in the dataset lasts 8 minutes.

Additionally, portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event.

Alongside the dataset, there is a suite of five benchmark tasks that cover the essential components of egocentric perception — indexing past experiences, analyzing present interactions, and anticipating future activities. Please see figure 2 for a quick overview.

Fig2 : Benchmark Tasks : The first-person visual experience — from remembering the past, to analyzing the present, to anticipating the future

Final Thoughts

In this article , I have provided a quick intro to the Ego4D dataset. This dataset has the potential to train robots to perform human-like tasks by observing humans (similar to the way we learn) or to create augmented reality applications that assist individuals in navigating through a series of tasks.

Meta has also released a new dataset, Ego-Exo4D, which encompasses some aspects not covered in the current dataset.

I intend to write two more articles: one discussing the multimodality and benchmarking tasks of Ego4D, and another explaining Ego-Exo4D.

References

[1] https://mobile-aloha.github.io/

[2] https://www.1x.tech/discover/all-neural-networks-all-autonomous-all-1x-speed

[3] https://ego4d-data.org/docs/challenge/

--

--

Roshni Ramnani
Antaeus AR

Innovator| AI-NLP Researcher | Developer. I have a varied and rich experience across research domains and programming languages.