Member-only story
Featured
Diving Deep In Foundational Video Models
Zero shot learning has been the goal of the entire AI community for a really long time. An AI that can adapt with few or no samples is a far better proposal than training on some particular use cases with thousands of examples. Zero-shot capabilities of Large Language Models (LLMs) propelled natural language processing from task-specific models to unified, generalist foundation models.
Thus it is genuine to ask whether we can apply same principle to today’s generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding?
Table Of Contents
- Introduction
- Google’s Veo 3: Foundation Model for Vision Tasks
- Some Interesting Results From Veo 3
- OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
- Conclusion
Introduction
Video models are moving fast towards unification, general-purpose foundation models for machine vision just like large language models (LLMs) have become foundation models for natural language processing (NLP).

