Representation Learning with Catalyst and Faces

Catalyst Team
Feb 4 · 7 min read

Authors: Dmytro Doroshenko, Nikita Balagansky, Sergey Kolesnikov — Catalyst Team

During the last two years, there was enormous progress in representation learning through the Face Recognition task. Starting from well-known ArcFace in 2018, there were a few other “Faces”: SubCenterArcFace, CosFace, AdaCos, CurricularFace, and more. The first “Face” layers were introduced with the Catalyst 20.10 release, but now is a great time to make a full intro into them. In this post, we would dive into “Faces”, introduce their intuition, and compare them on a small toy task.

Image for post
Image for post
An overview of the Face framework. In the training phase, the discriminative features are learned with a large margin between different classes. In the testing phase, the testing data is fed into the Face layer to extract features which are later used to compute the similarity score to perform verification and identification. Credits: https://arxiv.org/abs/1801.09414

Introduction

The original challenge comes from large-scale face recognition and the design of appropriate loss functions that enhance discriminative power during face representation learning. As a result, we want to get a minimal intra-class margin and a maximal inter-class margin for accurate face verification.

Before we dive into Faces, let’s also review other possible solutions for this task and answer the question if we could use them instead.

  1. Train typical supervised classifier, remove last linear layer and use these features as embeddings.
  2. Use metric learning approach — triplets, quadruplets, etc.
Image for post
Image for post
Supervised learning as representation learning

Speaking on the classification approach — we definitely can, but those embeddings' representation performance will be worse. The intuition here is quite simple — during a supervised learning task, there are no auxiliary tasks for the neural network to learn good feature representations.

Linear separability for classification ≠ discriminative feature representations.

Speaking of the second one, metric learning is another exciting topic of the representation learning field with its advantages and disadvantages. We suggest you read our previous blog post on Metric Learning (we are working on this branch too) to learn more about this approach.

From the face recognition task perspective, there are a few limitations for pure Metric Learning approach:

  • there is a combinatorial explosion in the number of required face triplets, leading to a significant increase in the training time,
  • sample mining is another difficult problem for effective model training.

Faces

The intuition of the “Faces” is quite simple — let’s use our previous, well-known classification approach with softmax and add an extra task for features discrimination, maximizing inter-class variance (between classes), and minimizing intra-class variance (within a particular class).
To get some extra intuition, let’s check the toy example from the paper:

Image for post
Image for post
Credits: https://arxiv.org/abs/1801.07698

While the classification approach focuses only on approximate “between classes” distinction, Faces also affect “within a class” distribution.

Image for post
Image for post
Softmax loss
Image for post
Image for post
“Face” loss. SphereFace, ArcFace, and CosFace in a united framework with m1, m2, and m3 as the hyper-parameters
Image for post
Image for post
Geometric difference

Over the past three months, we have implemented many “Faces” in our contrib:

  • ArcFace — incorporated margins in well-established loss functions in order to maximize face class separability.
  • CosFace — reformulated the softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.
  • AdaCos — a modified version of CosFace with adaptive scaling
  • SubCenterArcFace — same as ArcFace with the assumption that there are a few centroids for each training class
  • CurricularFace — embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, this layer adaptively adjusts the relative importance of easy and hard samples during different training stages.

For a fair benchmark, we will also compare several classification-based methods:

  • ArcMarginProduct — CosFace idea adaptation for supervised learning — a linear layer with normalization for both features and weights
  • Classification — just typical supervised learning approach

The experiment

Image for post
Image for post
ArcFace training pipeline example

To compare different representation learning approaches, let’s check a toy example on the imagewoof2 classification problem and visualize learned embeddings.

Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds. The breeds are: Australian terrier, Border terrier, Samoyed, Beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, Dingo, Golden retriever, Old English sheepdog. Source: fastai/imagenette.

If you would like to run the experiment on your own, please follow the link:

Model architecture

The model architecture with the “Face” layer for the classification task is straightforward:

  • encoder to extract representation from the input images,
  • “Face” head to turn the representation in the feature space to get discriminative class logits with respect to the targets

Model training

The model training with “Face” looks exactly the same as the typical classification pipeline with the CrossEntropy loss function. The only thing that differs — you should pass features and targets to our EncoderWithHead during training to get class-logits:

Model inference

The model inference step for representations extractions requires only encoder part of our model, but do not forget to normalize output embeddings:

Results

Image for post
Image for post
CurricularFace 3D PCA projection.

We used TSNE and 2D PCA projection for a visualization. PCA is especially great because of its interpretability and deterministic algorithm. To check our comparison results, we strongly suggest running our colab-notebook and seeing learned representations by yourself.

Image for post
Image for post
ArcFace — TSNE
Image for post
Image for post
ArcFace — PCA
Image for post
Image for post
CosFace — TSNE
Image for post
Image for post
CosFace — PCA
Image for post
Image for post
AdaCos — TSNE
Image for post
Image for post
AdaCos — PCA
Image for post
Image for post
SubCenterArcFace — TSNE
Image for post
Image for post
SubCenterArcFace — PCA
Image for post
Image for post
CurricularFace — TSNE
Image for post
Image for post
CurricularFace — PCA
Image for post
Image for post
ArcMarginProduct — TSNE
Image for post
Image for post
ArcMarginProduct — PCA
Image for post
Image for post
Classification — TSNE
Image for post
Image for post
Classification — PCA

2D projections are good, but we could also plot the dependence of explained variance ratio from the number of PCA components. Would “Faces” embeddings get lower explained PCA variance due to richer representations? Let’s check this out!

Image for post
Image for post
Explained variance ratio (up to 16 PCA components)
Image for post
Image for post
Explained variance ratio (up to 128PCA components)

By this benchmark, we can see that for the train data compression task, our approaches look the same.

Thanks to the embeddings-based nature, we could compare methods in terms of “DNN Accuracy” — model performance during supervised learning validation, and “KNN Accuracy” —KNN-search performance. Let’s check “Faces” with KNN-classifier compatibility.

KNN evaluation code
Image for post
Image for post
KNN Accuracy
Accuracy metrics comparison

The classification approach is the best according to the “DNN” performance as expected. A more interesting story comes with the CosFace — while it has the worst “DNN” performance, its learned representations are the best according to our tiny KNN benchmark.

Discussion

First, check them all with our colab-notebook — it’s totally free and very informative. As for a first trial, we suggest starting experiments with ArcFace (winners of Google Landmark Recognition 2020 used this criterion) or CosFace (best “KNN” performance). If you would like a one-line improvement for your supervised learning pipeline — use ArcMarginProduct to get good performance in both “DNN” and “KNN” setups.

To recap:

  • m — is a fixed parameter introduced to control the magnitude of the cosine margin.
  • s— parameter to control the contributing of l2 norm of a feature vector to the scoring function.

For the first experiments, we highly suggest setting s as sqrt(2) * log(num_classes - 1) as it was studied in AdaCos paper.

Conclusion and future plans

In this post, we have checked the new Catalyst Faces contrib and compared them on a toy ImageWoof classification task. In future posts, we will cover extra metric learning topics, speak about contrastive learning and other cross-areas connections. So… stay tuned and follow us on Twitter, Medium, Youtube, or check catalyst-team.com for more examples.

Stay safe and let good representations be with you!

Catalyst Team

Accelerated deep learning R&D.

Thanks to Nikita Balagansky

Catalyst Team

Written by

Catalyst Team

PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write another regular train loop. Break the cycle — use the Catalyst!

Catalyst Team

Written by

Catalyst Team

PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write another regular train loop. Break the cycle — use the Catalyst!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store