How I-JEPA is Redefining the Ring of Self-Supervised Learning: A Knockout Approach

15 min readAug 5, 2023

Self-supervised learning, a rapidly evolving subfield of artificial intelligence, enables models to learn from unlabeled data. The Image-Based Joint Embedding Predictive Architecture (I-JEPA), as presented in “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture”, offers a pioneering, non-generative approach to self-supervised learning from images.

I-JEPA operates by predicting the representations of multiple target blocks within a single image using a single context block. The method involves sampling the target block at a broad scale and harnessing spatially distributed information-rich context blocks for prediction.

Thus, In the high-stakes world of image databases, an underdog named I-JEPA stepped into the ring. It was a rookie, ready to take on the big names. I-JEPA’s secret weapon? It didn’t rely on the usual manual data augmentation tricks. Instead, it used its own smarts and scalability to outperform its rivals. In the first round, it faced off against the titans of ImageNet-1k and won, proving that it could scale up with the best of them. Then, in a surprising twist, it showed its prowess in a semi-supervised test on ImageNet-1K with only 1% of labels, performing just as well as the seasoned pros. But I-JEPA was not done. It stepped into different arenas, from image classification tasks to local prediction tasks, demonstrating its versatility and performance each time. It was a crowd-pleaser, outperforming competitors even when blindfolded.

The secret to I-JEPA’s success? It’s efficient, scalable, and can learn quickly and effectively, which saves computational energy. Plus, it’s a firm believer that the bigger the crowd (the training dataset), the better the performance.

I-JEPA is more than just a contender; it’s a game-changer in the world of self-supervised learning, and it’s here to stay. So, sit back, grab your popcorn, and let’s dive into the thrilling journey of this remarkable underdog.

Self-supervised learning

Self-supervised learning is a type of autonomous learning that doesn’t require humans to label or classify data. Instead, these systems learn by identifying and using contextual information that’s naturally present in the data. The primary goal isn’t to understand the data’s structure, but to carry out classifications without needing any pre-labeled data.

Suppose we have a large collection of unlabelled photographs, and we want to train a machine learning model to recognize the objects in these images. In a traditional supervised learning scenario, we would need to manually label each image with the objects it contains. This process can be time-consuming and often impractical when dealing with large datasets.

In a self-supervised learning scenario, instead of relying on human-made labels, the model would learn to recognize patterns and features in the images by itself. One common method used in self-supervised learning with images is to train the model to predict the color of a grayscale image, or to predict a missing part of the image.

In this way, the model uses the context naturally available in the data (the grayscale or partial image) to learn (predict the color or the missing part). Over time, the model would become more adept at recognizing patterns and features in the images, which can then be used for object recognition or other tasks.

The advantage of this approach is that it doesn’t require labeled data, and can therefore leverage large amounts of unlabeled data that would otherwise be difficult to use in a supervised learning scenario.

So, in our case if we train a model to predict the color of a grayscale image, the “label” or “correct answer” the model is trying to predict (the color version of the image) is derived directly from the data. We don’t need a human to tell us what the color version of the image is because we already have that information. In this sense, the model is supervising its own learning process by creating a learning task from the data itself.

Getting the data ready:

Start with a picture that’s in color.
Turn this color picture into a black and white (grayscale) version. This black and white picture is what we give to the model to work on.
Keep the original color picture safe as the “right answer” or label.

Teaching the model:

Give the black and white picture to the model.
The model will try its best to guess what the color picture should look like. At first, the model’s guesses probably won’t be very good, because it’s still learning.

Helping the model learn:

Check how close the model’s color picture guess is to the original color picture (the “right answer”).
Figure out how big the difference is between the model’s guess and the right answer. This difference is called the error or loss.
Use this error to teach the model how to do better next time, by adjusting its inner workings (parameters) of such as a Convolutional Neural Network (CNN) or a Vision Transformer (ViT), which are commonly used for image-related tasks, a little bit. This is done using gradient descent.

Keep practicing:

Keep giving the model lots of black and white pictures to practice on, and keep adjusting its inner workings based on its mistakes.
After a lot of practice, the model will get better and better at guessing the color picture from the black and white picture.

So, even though the model is technically supervising its own learning, we still refer to this as a form of supervised learning because the model is learning to map input data (the grayscale image) to a correct output (the color image), just like in traditional supervised learning. The key difference is where the “correct answers” come from: in traditional supervised learning, they are provided by humans, while in self-supervised learning, they are derived from the data itself.

Self-supervised vs. supervised learning

Self supervised learning is supervised learning because its objective is to learn a function of pairs of labeled inputs and outputs.

Self-supervised vs. unsupervised learning

Self-supervised is like unsupervised Learning because the system learns without using labels explicitly provided. It is different from unsupervised learning because we are not learning the inherent structure of the data.

Self-Supervised vs. semi-supervised learning

The combination of labeled and unlabeled data is used to train algorithms for semi-supervised learning, where small amounts of labeled data along with large amounts of unlabeled data can speed up learning tasks. He Self-supervised learning is different, since systems learn completely without use explicitly provided tags

Unsupervised=self supervised?

“I now call it “self-supervised learning”, because “unsupervised” is both a loaded and confusing term. In self-supervised learning, the system learns to predict part of its input from other parts of it input. In other words a portion of the input is used as a supervisory signal to a predictor fed with the remaining portion of the input. Self-supervised learning uses way more supervisory signals than supervised learning, and enormously more than reinforcement learning. That’s why calling it “unsupervised” is totally misleading. That’s also why more knowledge about the structure of the world can be learned through self-supervised learning than from the other two paradigms: the data is unlimited, and amount of feedback provided by each example is huge.”
Yann LeCun, April 30, 2019

The Science Underpinning I-JEPA

I-JEPA, however, proposes a method to sidestep excessive prior knowledge of encoding by image conversion and enhance the semantic level of expressions in self-supervised learning.

Self-supervised learning generally falls into two categories:

Invariant-based
Generative methods.

Invariant-based methods aim to learn the same representation for identical objects derived from different views, transformations, and noises. These methods focus on ‘immutability’, striving to enhance the model’s resilience to particular transformations of the data.

On the other hand, generative techniques, also known as masked noise approaches, make predictions at the pixel or token level.

Despite the potential for these methods to generalize beyond imaging modalities, they often result in a low semantic level of the resulting representation.

Addressing these limitations, I-JEPA proposes a method that sidesteps the excessive prior knowledge of encoding by image conversion and enhances the semantic level of expressions in self-supervised learning.

Fundamental Principles of I-JEPA

I-JEPA operates on the premise of predicting missing information within an abstract representation space. The architecture, given a particular context block, forecasts the representations of several target blocks within the same image. In contrast to existing generative methods that predict in pixel/token space, I-JEPA utilizes abstract prediction targets, eliminating the need for superfluous pixel-level detail and enabling the model to learn more semantic features.

To bolster its predictive capabilities, I-JEPA employs a multi-block masking strategy, underlining the significance of predicting sufficiently large target blocks in images with the help of informative, spatially distributed context blocks.

Architecture Comparison

The field of visual representation learning has long been focused on predicting the value of missing or corrupted sensory inputs. Techniques including denoising autoencoders, context encoders, and techniques that treat missing image coloring as a denoising task have been investigated. However, I-JEPA’s primary goal is to learn semantic representations that do not necessitate extensive fine-tuning in downstream tasks.

The framework of Image-based Joint Embedding Predictive Architecture (I-JEPA) contains 3 blocks: context block, target block, and predictor.

Context Block in I-JEPA’s boxing ring, it uses one ‘strategy’ block, or ‘context block’, to predict the characteristics of multiple ‘target blocks’ within the same image. The context encoder, trained in the style of a Vision Transformer (ViT), concentrates on understanding the visible context patches to create meaningful characteristics.

The ‘target block’ represents the characteristics of the image blocks, which are predicted using a single context block. These characteristics are crafted by the target encoder, and their weights are adjusted during each round with the context block using a method that’s like studying past fight tapes — an exponential moving average algorithm based on the context weights. To bring out these target blocks, a masking technique is used on the output of the target encoder, rather than the input.

Moreover, the I-JEPA’s predictor is a slimmed-down version of the Vision Transformer (ViT). It uses the output from the context encoder to forecast the characteristics of a target block at a specific location, all under the guiding light of positional tokens. The loss is the average L2 distance (like the average ‘punch power’ in our boxing metaphor) between the predicted and the actual characteristics of the target patches. Just like a boxer refining his strategy, the predictor and context encoder parameters are honed through gradient-based optimization. Meanwhile, the target encoder parameters learn from the wisdom of the context-encoder parameters, using an approach similar to learning from past fights — the exponential moving average.

I-JEPA’s Methodology

I-JEPA’s process is quite straightforward:

It sizes up the image, dividing it into multiple blocks, and picks out four target blocks. This is like a boxer studying his opponent and identifying key areas to strike.
Here’s where the Vision Transformer (ViT), its trusty trainer, comes in. The ViT acts as a context encoder, a target encoder, and a predictor. This is akin to a boxer’s trainer who maps out the fight plan, anticipates the opponent’s moves, and adjusts the strategy mid-fight. And while the ViT might remind some of the Masked AutoEncoder (MAE), it’s got its own unique spin — I-JEPA doesn’t just throw punches in the dark; it lands calculated blows in what we call the representation space.
The predictor, kind of like a boxing coach whispering tactics into I-JEPA’s ear, takes the context encoder’s output and predicts the representation of the target block at a specific position, guided by the position token. It’s like a boxer predicting his opponent’s next move based on his current stance. This target representation matches the output of the target encoder, and its weights are constantly updated, dancing to the rhythm of the exponential moving average of the context encoder’s weights.

And that’s I-JEPA’s fight strategy — a straightforward, yet powerful approach to self-supervised learning. It’s clear that I-JEPA is no ordinary contender; it’s a heavyweight champ in the making.

In other words, 1) Predict representations of multiple (default 4) target blocks in the same image using a given context block.
2) Utilize the Vision Transformer (ViT) for context encoder, target encoder, and predictor. Although it’s similar to Masked AutoEncoder (MAE), I-JEPA is non-generative, and predictions are made in representation space.
3) A predictor takes the output of the context encoder and predicts the representation of the target block at a given position, conditional on the position token. The target representation corresponds to the output of the target encoder, and its weights are updated by an exponential moving average of the context encoder weights.

Experimental Findings

Performance Check on ImageNet

I-JEPA was tested on ImageNet-1k (ViT H/16_448 pretrained at 448 × 448 resolution), a large image database. It did better than other methods that don’t use manual data augmentation. Larger I-JEPA models performed as well as methods that rely on data augmentation, showing that it can scale well. so that larger I-JEPA models do not require view data extensions, and the performance matches the view immutability approach .

Evaluation using ImageNet-1K 1%

In tests, the I-JEPA approach, when using the ViT H/14 architecture, performed just as well as ViT-L/16 pretrained with data2vec but required less computational power. Moreover, by increasing image resolution, I-JEPA surpassed other joint embedding methods that use manual data augmentation during pretraining. This indicates I-JEPA’s effectiveness. Also, in a semi-supervised test on ImageNet-1K, even with only 1% of the labels available, I-JEPA outperformed traditional methods that use manual data augmentation, showing that it benefits greatly from scale.

Thus, I-JEPA was as good as other methods, even though it required less computational power. It also performed better when the image resolution was increased.

Transfer learning

I-JEPA was tested on different image classification tasks. It did better than other methods (MAE and data2vec) that don’t use data augmentation and was almost as good as methods that do.

Furthermore, I-JEPA has successfully bridged the gap with top-performing approaches that use view-immutability and data augmentation. Notably, when tested with CIFAR100 and Place205 using linear probes, I-JEPA performed better than DINO.

Local Prediction task

I-JEPA did better than other methods (such as DINO and iBOT) on tasks such as counting objects and predicting depth in the Clevr dataset. This shows that it can capture low-level image features.

This demonstrates that I-JEPA is capable of identifying and using low-level image features during its initial learning phase, leading to impressive results on tasks that require detailed and high-density predictions.

Scalability

model efficiency

I-JEPA performs better while using less computational power compared to previous methods, and it doesn’t depend on data augmentation. Specifically, it converges about five times faster than reconstruction-based techniques like MAE, even though it does more complex computations in the representation space. Additionally, I-JEPA operates significantly faster than methods based on view invariance, like iBOT, that rely on data augmentation. Notably, even the large I-JEPA model (ViT-H/14) requires less computation than the smaller iBOT model(ViT-S/16).

Scaling by data size

Increasing the size of the training dataset improved the performance of the model on different tasks, showing the benefits of using larger and more diverse datasets.

Scaling by model size

A larger model, specifically ViT-G/16, proved to be effective when pre-trained on IN22K. This larger model significantly improved performance in image classification tasks, such as Place205 and INat18, compared to smaller models like ViT-H/14. However, this size did not enhance performance for tasks that require low-level detail, possibly because the larger model uses bigger input patches, which might not be ideal for tasks that require detailed, local predictions.

Predictor Visualizations

Evaluate predictor learning effect

To test whether the predictor accurately captures the uncertainty of a target’s position, the predictor’s weights and the context encoder were fixed after pretraining. A decoder based on the Representation Conditional Diffusion Model (RCDM) was also trained. Visualization of the predictor’s output showed that it could accurately capture positional uncertainty and generate high-level object parts, such as the back of a bird or the top of a car. However, it tended to overlook precise low-level details and background information. so in summary the predictor was tested to see if it can capture the uncertainty of the target position. It was found that it could accurately generate high-level object parts, but not low-level details or background information.

Ablations

Comparison of masking strategies

Multi-block masking, a technique that splits an image into four parts and uses one part to predict the others, is a common strategy in rasterized masking and reconstruction-based methods. This was compared with other strategies, such as traditional block masking and random masking. After training the ViT-B/16 model for 300 epochs, the multi-block masking approach was found to be effective when tested on ImageNet-1K, even when using just 1% of the available labels.

Prediction in representation space

When tested on 1% of ImageNet-1K data using linear probes, I-JEPA showed improved performance. This improvement is attributed to the computation of loss in representation space, not pixel space. This method likely enhances the target encoder’s ability to generate more abstract predictions. Predicting in pixel space has been found to lower performance, emphasizing the significance of the target encoder during pretraining.

Summary

I-JEPA’s performance was assessed in several experiments, including image classification tasks and local prediction tasks. In the image classification tasks, I-JEPA outperformed other methods on linear evaluation in ImageNet-1K. It also demonstrated high performance on various downstream image classification tasks using linear probes.

In local prediction tasks, I-JEPA outperformed view-invariant-based methods (such as DINO and iBOT) on the low-level tasks object counting and depth prediction for the Clevr dataset. This confirms that I-JEPA effectively captures low-level image features during pretraining, resulting in excellent performance on low-level and high-density prediction tasks.

Additionally, I-JEPA showed great scalability in terms of model efficiency and data size. It uses less computation than previous methods and achieves high performance without relying on data augmentation. Increasing the size of the pretraining dataset improved the performance of transfer learning on both semantic and low-level tasks.

Conclusion

So, as our story comes to an end, it’s clear that I-JEPA is not just another contender in the field of self-supervised learning. No, folks, it’s a trailblazer, a game-changer. This savvy newcomer has a knack for learning high-quality representations, and it does so without relying on manual data augmentation, a common crutch for many in the field.

In the great battle of tasks, I-JEPA proves its worth time and time again. Whether it’s taking on linear probes of ImageNet-1K, wrestling with semi-supervised learning with just 1% of ImageNet-1K, or diving into semantic transfer tasks, I-JEPA is up to the challenge. It doesn’t just match the performance of view-invariant pre-training methods on semantic tasks; it often surpasses them.

And when it comes to low-level visual tasks, well, let’s just say I-JEPA delivers results that are the talk of the town. This efficiency and scalability, coupled with its ability to reduce the overall computational requirements for self-supervised pre-training, have quickly catapulted I-JEPA to a position of prominence in the field of self-supervised learning.

So here’s to I-JEPA, the rookie that’s shaking up the game and showing us all how it’s done! We can’t wait to see what it will do next.

I’m Joe, and my ambition is to lead the way to industry 5.0 performance. I’m always interested in new opportunities, so don’t hesitate to contact me on my LinkedIn.