Representation Learning with Catalyst and Faces
Authors: Dmytro Doroshenko, Nikita Balagansky, Sergey Kolesnikov — Catalyst Team
During the last two years, there was enormous progress in representation learning through the Face Recognition task. Starting from well-known ArcFace in 2018, there were a few other “Faces”: SubCenterArcFace, CosFace, AdaCos, CurricularFace, and more. The first “Face” layers were introduced with the Catalyst 20.10 release, but now is a great time to make a full intro into them. In this post, we would dive into “Faces”, introduce their intuition, and compare them on a small toy task.
Introduction
The original challenge comes from large-scale face recognition and the design of appropriate loss functions that enhance discriminative power during face representation learning. As a result, we want to get a minimal intra-class margin and a maximal inter-class margin for accurate face verification.
Other approaches
Before we dive into Faces, let’s also review other possible solutions for this task and answer the question if we could use them instead.
- Train typical supervised classifier, remove last linear layer and use these features as embeddings.
- Use metric learning approach — triplets, quadruplets, etc.
Speaking on the classification approach — we definitely can, but those embeddings' representation performance will be worse. The intuition here is quite simple — during a supervised learning task, there are no auxiliary tasks for the neural network to learn good feature representations.
Linear separability for classification ≠ discriminative feature representations.
Speaking of the second one, metric learning is another exciting topic of the representation learning field with its advantages and disadvantages. We suggest you read our previous blog post on Metric Learning (we are working on this branch too) to learn more about this approach.
From the face recognition task perspective, there are a few limitations for pure Metric Learning approach:
- there is a combinatorial explosion in the number of required face triplets, leading to a significant increase in the training time,
- sample mining is another difficult problem for effective model training.
Faces
The intuition of the “Faces” is quite simple — let’s use our previous, well-known classification approach with softmax and add an extra task for features discrimination, maximizing inter-class variance (between classes), and minimizing intra-class variance (within a particular class).
To get some extra intuition, let’s check the toy example from the paper:
While the classification approach focuses only on approximate “between classes” distinction, Faces also affect “within a class” distribution.
Faces family
Over the past three months, we have implemented many “Faces” in our contrib:
- ArcFace — incorporated margins in well-established loss functions in order to maximize face class separability.
- CosFace — reformulated the softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.
- AdaCos — a modified version of CosFace with adaptive scaling
- SubCenterArcFace — same as ArcFace with the assumption that there are a few centroids for each training class
- CurricularFace — embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, this layer adaptively adjusts the relative importance of easy and hard samples during different training stages.
For a fair benchmark, we will also compare several classification-based methods:
- ArcMarginProduct — CosFace idea adaptation for supervised learning — a linear layer with normalization for both features and weights
- Classification — just typical supervised learning approach
The experiment
To compare different representation learning approaches, let’s check a toy example on the imagewoof2 classification problem and visualize learned embeddings.
Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds. The breeds are: Australian terrier, Border terrier, Samoyed, Beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, Dingo, Golden retriever, Old English sheepdog. Source: fastai/imagenette.
If you would like to run the experiment on your own, please follow the link:
Model architecture
The model architecture with the “Face” layer for the classification task is straightforward:
- encoder to extract representation from the input images,
- “Face” head to turn the representation in the feature space to get discriminative class logits with respect to the targets
Model training
The model training with “Face” looks exactly the same as the typical classification pipeline with the CrossEntropy loss function. The only thing that differs — you should pass features and targets to our EncoderWithHead
during training to get class-logits:
Model inference
The model inference step for representations extractions requires only encoder
part of our model, but do not forget to normalize output embeddings:
Results
We used TSNE and 2D PCA projection for a visualization. PCA is especially great because of its interpretability and deterministic algorithm. To check our comparison results, we strongly suggest running our colab-notebook and seeing learned representations by yourself.
ArcFace
CosFace
AdaCos
SubCenterArcFace
CurricularFace
ArcMarginProduct
Classification
Train data compression — explained PCA variance
2D projections are good, but we could also plot the dependence of explained variance ratio from the number of PCA components. Would “Faces” embeddings get lower explained PCA variance due to richer representations? Let’s check this out!
By this benchmark, we can see that for the train data compression task, our approaches look the same.
Test data performance — KNN Accuracy
Thanks to the embeddings-based nature, we could compare methods in terms of “DNN Accuracy” — model performance during supervised learning validation, and “KNN Accuracy” —KNN-search performance. Let’s check “Faces” with KNN-classifier compatibility.
The classification approach is the best according to the “DNN” performance as expected. A more interesting story comes with the CosFace — while it has the worst “DNN” performance, its learned representations are the best according to our tiny KNN benchmark.
Discussion
Which “Face” should I use?
First, check them all with our colab-notebook — it’s totally free and very informative. As for a first trial, we suggest starting experiments with ArcFace (winners of Google Landmark Recognition 2020 used this criterion) or CosFace (best “KNN” performance). If you would like a one-line improvement for your supervised learning pipeline — use ArcMarginProduct to get good performance in both “DNN” and “KNN” setups.
How to select s
and m
parameters for a “Face”?
To recap:
m
— is a fixed parameter introduced to control the magnitude of the cosine margin.s
— parameter to control the contributing of l2 norm of a feature vector to the scoring function.
For the first experiments, we highly suggest setting s
as sqrt(2) * log(num_classes - 1)
as it was studied in AdaCos paper.
Conclusion and future plans
In this post, we have checked the new Catalyst Faces contrib and compared them on a toy ImageWoof classification task. In future posts, we will cover extra metric learning topics, speak about contrastive learning and other cross-areas connections. So… stay tuned and follow us on Twitter, Medium, Youtube, or check catalyst-team.com for more examples.
Stay safe and let good representations be with you!