Stanford’s VideoAgent Achieves New SOTA of Long-Form Video Understanding via Agent-Based System

Published in

SyncedReview

3 min readMar 19, 2024

Understanding long-form videos presents a formidable challenge within the realm of computer vision. This undertaking requires a model adept at processing multi-modal data, managing extensive sequences, and effectively reasoning over these sequences.

In response to this challenge, in a new paper VideoAgent: Long-form Video Understanding with Large Language Model as Agent, a Stanford University research team introduces VideoAgent, an innovative approach simulates human comprehension of long-form videos through an agent-based system, showcasing superior effectiveness and efficiency compared to current state-of-the-art methods. This underscores the potential of agent-based approaches in advancing long-form video understanding.

VideoAgent operates by employing a large language model (LLM) as a central agent to iteratively identify and compile crucial information to address a given question, while vision-language foundation models serve as tools to translate and retrieve visual information.

The process is formulated as a sequence of states, actions, and observations, with the LLM orchestrating this progression. Initially, the LLM acquaints itself with the video context by reviewing a set of uniformly sampled frames. During each iteration, it evaluates whether the existing information is adequate to answer the question; if not, it determines what additional information is necessary. It then utilizes Contrastive Language-Image Pre-training (CLIP) to retrieve new frames containing this information and Vision-Language Models (VLM) to caption these frames, updating the current state.

This design accentuates the importance of reasoning capabilities and iterative processes over direct processing of lengthy visual inputs. The VLM and CLIP act as instrumental tools, enabling the LLM to possess visual understanding and long-context retrieval capabilities.

The efficacy of VideoAgent was assessed on two established long-form video understanding benchmarks, EgoSchema and NExT-QA. VideoAgent achieved 54.1% and 71.3% accuracy on these benchmarks, respectively, surpassing the concurrent state-of-the-art method LLoVi by 3.8% and 3.6%.

In summary, VideoAgent marks a significant advancement in long-form video understanding by embracing an agent-based system to emulate human cognitive processes and emphasizing the importance of reasoning over modeling long-context visual information. The researchers anticipate that their work not only establishes a new benchmark in long-form video understanding but also provides valuable insights for future research in this domain.

The paper VideoAgent: Long-form Video Understanding with Large Language Model as Agent is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Stanford’s VideoAgent Achieves New SOTA of Long-Form Video Understanding via Agent-Based System

Written by Synced