Member-only story
DINO-ViT — Beyond Self-Supervised Classifications
Distill Fine-Grained Features Without Supervision
Previously, I have written several articles briefly discussing self-supervised learning and, in particular, contrastive learning. What was not yet covered, however, was a concurrent branch of self-supervised approach using interactions of multiple networks that seems to emerge and excel recently. As of today, one of the state-of-the-art training methods is a predominantly knowledge distilling method named DINO imposed on vision transformers (DINO-ViT). The most surprising element of this architecture, however, is no longer its strong classification knowledge, but its dense features that are actually capable of performing much more fine-grained tasks such as part segmentation and even correspondence across multiple objects.
In this article, we will go over how the DINO-ViT is trained, followed by a brief tutorial on how to utilise existing libraries for part co-segmentation and finding correspondences.
What is DINO-ViT?
The term DINO came from self-DIstillation with NO supervision. As its name suggests, DINO-ViT utilises a variant of the traditional knowledge distillation method and applies it to the powerful vision transformer (ViT) architecture. This idea is somewhat inspired by the…

