Catalyst Team
Published in

Catalyst Team

Representation Learning with Catalyst and Faces

Authors: Dmytro Doroshenko, Nikita Balagansky, Sergey Kolesnikov — Catalyst Team

During the last two years, there was enormous progress in representation learning through the Face Recognition task. Starting from well-known ArcFace in 2018, there were a few other “Faces”: SubCenterArcFace, CosFace, AdaCos, CurricularFace, and more. The first “Face” layers were introduced with the Catalyst 20.10 release, but now is a great time to make a full intro into them. In this post, we would dive into “Faces”, introduce their intuition, and compare them on a small toy task.

An overview of the Face framework. In the training phase, the discriminative features are learned with a large margin between different classes. In the testing phase, the testing data is fed into the Face layer to extract features which are later used to compute the similarity score to perform verification and identification. Credits:


The original challenge comes from large-scale face recognition and the design of appropriate loss functions that enhance discriminative power during face representation learning. As a result, we want to get a minimal intra-class margin and a maximal inter-class margin for accurate face verification.

Other approaches

Before we dive into Faces, let’s also review other possible solutions for this task and answer the question if we could use them instead.

  1. Train typical supervised classifier, remove last linear layer and use these features as embeddings.
  2. Use metric learning approach — triplets, quadruplets, etc.
Supervised learning as representation learning

Speaking on the classification approach — we definitely can, but those embeddings' representation performance will be worse. The intuition here is quite simple — during a supervised learning task, there are no auxiliary tasks for the neural network to learn good feature representations.

Linear separability for classification ≠ discriminative feature representations.

Speaking of the second one, metric learning is another exciting topic of the representation learning field with its advantages and disadvantages. We suggest you read our previous blog post on Metric Learning (we are working on this branch too) to learn more about this approach.

From the face recognition task perspective, there are a few limitations for pure Metric Learning approach:

  • there is a combinatorial explosion in the number of required face triplets, leading to a significant increase in the training time,
  • sample mining is another difficult problem for effective model training.


The intuition of the “Faces” is quite simple — let’s use our previous, well-known classification approach with softmax and add an extra task for features discrimination, maximizing inter-class variance (between classes), and minimizing intra-class variance (within a particular class).
To get some extra intuition, let’s check the toy example from the paper:


While the classification approach focuses only on approximate “between classes” distinction, Faces also affect “within a class” distribution.

Softmax loss
“Face” loss. SphereFace, ArcFace, and CosFace in a united framework with m1, m2, and m3 as the hyper-parameters

Faces family

Geometric difference

Over the past three months, we have implemented many “Faces” in our contrib:

  • ArcFace — incorporated margins in well-established loss functions in order to maximize face class separability.
  • CosFace — reformulated the softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.
  • AdaCos — a modified version of CosFace with adaptive scaling
  • SubCenterArcFace — same as ArcFace with the assumption that there are a few centroids for each training class
  • CurricularFace — embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, this layer adaptively adjusts the relative importance of easy and hard samples during different training stages.

For a fair benchmark, we will also compare several classification-based methods:

  • ArcMarginProduct — CosFace idea adaptation for supervised learning — a linear layer with normalization for both features and weights
  • Classification — just typical supervised learning approach

The experiment

ArcFace training pipeline example

To compare different representation learning approaches, let’s check a toy example on the imagewoof2 classification problem and visualize learned embeddings.

Imagewoof is a subset of 10 classes from Imagenet that aren’t so easy to classify, since they’re all dog breeds. The breeds are: Australian terrier, Border terrier, Samoyed, Beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, Dingo, Golden retriever, Old English sheepdog. Source: fastai/imagenette.

If you would like to run the experiment on your own, please follow the link:

Model architecture

The model architecture with the “Face” layer for the classification task is straightforward:

  • encoder to extract representation from the input images,
  • “Face” head to turn the representation in the feature space to get discriminative class logits with respect to the targets

Model training

The model training with “Face” looks exactly the same as the typical classification pipeline with the CrossEntropy loss function. The only thing that differs — you should pass features and targets to our EncoderWithHead during training to get class-logits:

Model inference

The model inference step for representations extractions requires only encoder part of our model, but do not forget to normalize output embeddings:


CurricularFace 3D PCA projection.

We used TSNE and 2D PCA projection for a visualization. PCA is especially great because of its interpretability and deterministic algorithm. To check our comparison results, we strongly suggest running our colab-notebook and seeing learned representations by yourself.


ArcFace — TSNE
ArcFace — PCA


CosFace — TSNE
CosFace — PCA


AdaCos — TSNE
AdaCos — PCA


SubCenterArcFace — TSNE
SubCenterArcFace — PCA


CurricularFace — TSNE
CurricularFace — PCA


ArcMarginProduct — TSNE
ArcMarginProduct — PCA


Classification — TSNE
Classification — PCA

Train data compression — explained PCA variance

2D projections are good, but we could also plot the dependence of explained variance ratio from the number of PCA components. Would “Faces” embeddings get lower explained PCA variance due to richer representations? Let’s check this out!

Explained variance ratio (up to 16 PCA components)
Explained variance ratio (up to 128PCA components)

By this benchmark, we can see that for the train data compression task, our approaches look the same.

Test data performance — KNN Accuracy

Thanks to the embeddings-based nature, we could compare methods in terms of “DNN Accuracy” — model performance during supervised learning validation, and “KNN Accuracy” —KNN-search performance. Let’s check “Faces” with KNN-classifier compatibility.

KNN evaluation code
KNN Accuracy
Accuracy metrics comparison

The classification approach is the best according to the “DNN” performance as expected. A more interesting story comes with the CosFace — while it has the worst “DNN” performance, its learned representations are the best according to our tiny KNN benchmark.


Which “Face” should I use?

First, check them all with our colab-notebook — it’s totally free and very informative. As for a first trial, we suggest starting experiments with ArcFace (winners of Google Landmark Recognition 2020 used this criterion) or CosFace (best “KNN” performance). If you would like a one-line improvement for your supervised learning pipeline — use ArcMarginProduct to get good performance in both “DNN” and “KNN” setups.

How to select s and m parameters for a “Face”?

To recap:

  • m — is a fixed parameter introduced to control the magnitude of the cosine margin.
  • s— parameter to control the contributing of l2 norm of a feature vector to the scoring function.

For the first experiments, we highly suggest setting s as sqrt(2) * log(num_classes - 1) as it was studied in AdaCos paper.

Conclusion and future plans

In this post, we have checked the new Catalyst Faces contrib and compared them on a toy ImageWoof classification task. In future posts, we will cover extra metric learning topics, speak about contrastive learning and other cross-areas connections. So… stay tuned and follow us on Twitter, Medium, Youtube, or check for more examples.

Stay safe and let good representations be with you!




PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write another regular train loop. Break the cycle — use the Catalyst!

Recommended from Medium

Decoding strategies in language modelling

Why Has Cloud-Native AI/Machine Learning Stalled in the Enterprise?

What is Machine Learning?

Chatbot Using Deep Learning

Creating Convolutional Neural Networks from Scratch:

MachineX: Sentiment analysis with NLTK and Machine Learning

Naive Bayes and LSTM Based Classifier Models

Text generation can now be done by AI — Image by MILKOVÍ on Unsplash

Things you need to take care when training RNN on your Facebook messages

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Catalyst Team

Catalyst Team

More from Medium

Troubles with saving models with custom layers in Tensorflow

Machine Learning: The future

Using Active Learning to Improve Ethnic Group Classification

Differentiality meets Conditional Computation