JEPA is the Future of Video Understanding

6 min readApr 19, 2024

Yann LeCun, a prominent figure in artificial intelligence, believes that the future of AI lies not in generative models, but in predictive architectures like V-JEPA (Video Joint-Embedding Predictive Architecture). As a visionary who has significantly contributed to the development of deep learning, LeCun argues that understanding and interpreting the complexity of the real world goes beyond what generative AI can achieve with its focus on creating content. Instead, V-JEPA aims to predict and analyze information from videos, closely mimicking the way humans perceive and learn from their environments. This approach, according to LeCun, could lead to AI systems that not only comprehend but reason with the intelligence akin to human common sense. In the following sections, we will explore what V-JEPA is and why LeCun champions this model as the cornerstone of future AI technologies.

This article is based on Meta’s paper release “Revisiting Feature Prediction for Learning Visual Representations from Video”

Below is the paper link and the github link.

Paper: 1
Github link: 2

Humans have an incredible ability to understand the world around them using the basic signals received from their eyes. In the field of machine learning, researchers aim to uncover the underlying principles and objectives that enable this learning process. The “Predictive feature principle” suggests that the ability to predict what will happen next based on current sensory information is crucial in human learning.

What is JEPA?

JEPA (joint-embedding predictive architectures) is a novel approach to learning visual representations, designed to predict hidden information within a video based on the visible content. Unlike generative AI models, which focus on creating new content, JEPA aims to capture the essential features and relationships within the video data.

V-JEPA (Video JEPA) is an unsupervised feature prediction method that learns from a large dataset of videos without relying on pre-trained encoders, text, negative samples, manual annotations, or pixel-level reconstruction. By using a combination of “masking modeling” and “JEPA,” V-JEPA can understand the deep meaning of images and videos, achieving high performance on various tasks without changing model weights.

JEPA vs. Generative AI

Generative AI models, such as GANs and VAEs, are trained to generate new content that resembles the training data. While generative AI has many applications, it does not focus on understanding the underlying structure and meaning of the input data. In contrast, JEPA learns meaningful representations by predicting hidden information, enabling it to perform well on tasks such as action recognition, object detection, and scene classification.

How V-JEPA Works

To understand how V-JEPA predicts the presence of an object (e.g., a ball) in a masked frame, let’s consider an example:

Masking: During training, random spatiotemporal regions of the video are masked.
Encoding visible regions: The visible regions are passed through the x-encoder, generating embeddings for each visible token.
Concatenating learnable mask tokens: The x-encoder output is concatenated with learnable mask tokens, which serve as placeholders for the masked regions.
Predicting masked regions: The concatenated sequence is passed through the predictor network, which outputs embeddings for each mask token.
Training objective: The predictor’s output is compared to the actual features of the masked region using a loss function, and the model is trained to minimize this loss.

The model’s ability to predict the ball is a result of learning the spatiotemporal context, using learnable mask tokens to identify the location of the masked regions, and employing a predictor network to fill in the information based on the visible regions.

Applications of JEPA

JEPA has numerous potential applications, including:

Surveillance and security
Sports analysis
Autonomous vehicles
Human-computer interaction

Method

As mentioned before, V-JEPA operates on video clips by dividing them into tokens and masking certain regions. The x-encoder processes the masked video sequence, and its output is concatenated with learnable mask tokens. The predictor network then outputs embeddings for each mask token, which are regressed to the prediction target using L1 loss.

The objective function is designed to ensure that representations computed from one part of the video can be predicted from representations computed from another part. The prediction task is based on masked modeling, using short-range and long-range masks to minimize information leakage.

V-JEPA uses a standard ViT as the encoder and a narrow Transformer as the predictor. The model is pre-trained on the “VideoMix2M” dataset and evaluated on various video and image tasks.

Important Aspects in Learning Representations from Videos

Experiments show that:

prediction in feature space consistently outperforms prediction in pixel space.
The average performance across tasks improves as the size of the pre-training dataset increases.
Adaptive pooling with cross-attention results in significant improvements on downstream supervised tasks compared to average pooling.
The multi-block masking method shows the best performance among the compared masking methods.

Evaluating the Predictor

To gain a deeper understanding of the reasons for feature prediction, a new decoder is trained to convert V-JEPA’s prediction results into an easily understandable pixel form. The visualization results show that the prediction results are spatially and temporally consistent with the unmasked regions of the video and capture consistent motion in the time direction.

Comparison with Prior Work

V-JEPA outperforms pixel reconstruction methods in terms of performance and learning speed. It shows consistent improvement over other video baselines on all tasks in fine-tuning-free evaluation and achieves significant performance improvements on tasks requiring action understanding. V-JEPA also demonstrates high label efficiency, with the performance gap between V-JEPA and other baselines increasing as the amount of labeled data decreases.

Conclusion

V-JEPA is a powerful method for learning visual representations from videos using feature prediction as the sole objective function. It outperforms pixel reconstruction-based methods and achieves state-of-the-art performance on multiple benchmarks. The multi-block masking strategy and high label efficiency demonstrate the versatility of the learned representations.

V-JEPA offers a promising alternative to generative AI models for learning meaningful representations from video data. By focusing on feature prediction, V-JEPA can learn efficient and effective representations that capture the essential structure and meaning of the input data, making it well-suited for a wide range of applications.

As the field of self-supervised learning continues to evolve, we can expect to see further developments and improvements in methods like V-JEPA, leading to even more powerful and versatile video understanding systems.

I’m Joe, and my ambition is to lead the way to industry 5.0 performance. I’m always interested in new opportunities, so don’t hesitate to contact me on my LinkedIn.