ArcFace — Architecture and Practical example: How to calculate the face similarity between images

7 min readFeb 4, 2024

Introduction

Recently, I have worked on a project related to face swapping. The majority of existing face-swapping algorithms use ArcFace[1] as a face feature extractor, and I was surprised by ArcFace’s capability to extract human face features accurately. In this blog, I will briefly introduce you to the ArcFace architecture and a practical example of calculating face image similarity with Python code.

1. ArcFace architecture

ArcFace is one of the famous deep face recognition methods nowadays. The main feature of ArcFace is applying an Additive Angular Margin Loss to enforce the intra-class(same person) compactness of embedding space and inter-class(others) discrepancy. Before ArcFace arrived, many proposed methods used the softmax loss as a classification loss in deep face recognition. However, there are some drawbacks to using only the softmax loss. One of them is that the softmax loss doesn’t optimize the feature embedding to constrain higher similarity between intra-class samples and diversity for inter-class samples, which means the boundaries between people are sometimes ambiguous, and it deteriorates the model performance. So, the authors introduced an Additive Angular Margin Loss to obtain further improvement for the discriminative power of face recognition. In the next paragraph, we will go through the mathematical comparison between the softmax loss and an additive angular margin loss.

The mathematical formulas of the softmax and an Additive Angular Margin loss (AAM for abbreviation) are as follows:

where 𝑥ᵢ ∈ ℝᵈ denotes the image feature of the i-th sample, belonging to the yᵢ-th class. The feature dimension is set to 512 conventionally. 𝘞ⱼ refers to the j-th column of the weight 𝘞 ∈ ℝᵈˣⁿ, bⱼ∈ ℝⁿ is the bias term, and the class number is N. In the formula(2), m denotes the margin later explained.

Well, so intimidating, but both formulas somehow look similar. How did the authors come up with this idea? As a premise, they want to make intra-class points closer but keep inter-class points away from each other. Then, they focus on the cosine similarity between data points. To understand this, you need to remember the cosine similarity and dot product relationship.

Returning to the formula (1), Wᵗ x can be described as the dot product of W and x. So, we can transform it to the formula below using formulas (3)(4).

𝜃ⱼis the angle between Wⱼ and xᵢ. When we fix the bias term b = 0 for simplicity, the formula (1) will be as follows:

The formula (9) looks closer to the formula (2)(I rewrite the summation in the denominator for the following process). Next, if we can make intra-class samples’ angles smaller but inter-class samples’ angles larger, that sounds good. To achieve it simultaneously, the authors introduced margin to the angle.

Compared to the formula (9), the model needs to learn the intra-class angle smaller because of the margin in the formula (10) case. If not, the model cannot classify samples correctly because each class area will be mixed.

Finally, we re-scale the cosine parameters to mitigate the effect that the correct label logits tend to have smaller values. You can imagine that the numerator value is smaller if we have many classes(typical face recognition problem).

In the practical implementation, you can insert the class below after the final dense layer. I referenced it from the Insightface implementation.

class AdditiveAngularMarginLoss(nn.Module):
    """ 
        Insightface implementation : https://github.com/deepinsight/insightface/blob/master/recognition/arcface_torch/losses.py
        ArcFace (https://arxiv.org/pdf/1801.07698v1.pdf):
    """
    def __init__(self, s=64.0, margin=0.5):
        super(AdditiveAngularMarginLoss, self).__init__()
        self.s = s
        self.margin = margin
        
        self.cos_m = math.cos(margin)
        self.sin_m = math.sin(margin)
        self.theta = math.cos(math.pi - margin)
        self.sinmm = math.sin(math.pi - margin) * margin
        self.easy_margin = False

    def forward(self, logits: torch.Tensor, labels: torch.Tensor):
        # get correct index
        index = torch.where(labels != -1)[0]
        # y_i in the formula of L_AAM
        target_logit = logits[index, labels[index].view(-1)]

        with torch.no_grad():
            # theta y_i
            target_logit.arccos_()
            # theta j (including target y_i)
            logits.arccos_()
            # theta y_i + m term
            final_target_logit = target_logit + self.margin
            # insert the above result to the right term of the denominator
            # for calculation convenience
            logits[index, labels[index].view(-1)] = final_target_logit
            # calculation cosine function
            logits.cos_()

        logits = logits * self.s   
        return logits

So far, we’ve been through the mathematical formulas of ArcFace, but is it true in reality? I want to visually check the image feature difference between the model without an Additive Angular Margin loss and the one with it. I used the MNIST dataset as an example and visualized both image features with references [3][4]. The environment is the same one that I explained later. The below graphs are the image feature visualization with t-SNE.

(Left) the image feature without AAM : (Right) the image feature with AAM

The left graph shows the image feature without an additive angular margin penalty, and the right graph shows the image feature with it. As you can see, the one with an Additive Angular Margin loss(right) has more dense image feature space in the intra-class but is more discriminative in the inter-class than the one without it(left). You can reproduce the result using gist below.

2. Practical example: How to calculate the face similarity with code

In this section, We will calculate the similarities between nearly identical celebrities. Does ArcFace recognize they are the same person? Let’s play!

2.1 Prepare environment

I used the Google Colab Notebook with T4 GPU for the below implementation. The library dependencies are as follows:

faiss-cpu==1.7.4
numpy==1.23.5
pandas==1.5.3
Pillow==9.4.0
plotly==5.15.0
torch==2.1.0+cu121
torchvision==0.16.0+cu121
tqdm==4.66.1

Moreover, you need to download arcface pre-trained model weight from here [9].

We’ve done all the preparation for coding.

Next, based on this site [6], I collected seven pairs of celebrities who look nearly identical, with five images for each celebrity.

The pair of celebrities to check the face similarities

The whole data are shown below:

2.2 Calculate face similarity using Faiss

I use ArcFace with an ir-se backbone[7], which is frequently used for face swap tasks. Firstly, I want to check the image feature relationship among data, so I apply t-SNE to image features for visualization. The result is as shown below:

The same person tends to be located closer. However, the data distribution looks more complicated compared to the MNIST case. How about the cosine similarity among them? The result is as shown below:

The some sample images of the result of face similarity comparison

The left column refers to the target image, and the other images in the same row refer to the top 3 most similar images ArcFace picked up. I think that ArcFace focuses more on the feel of images like hairstyle, facial expression, or the light condition besides the facial appearance. The interesting point is that ArcFace sometimes doesn’t pick up the same person’s image for most similar images! As a result, even though it doesn’t pick up the same person’s other photos and the pair of celebrities all the time, it can choose images with similar facial appearances. You may understand why many face-swapping architectures use ArcFace as a face feature extractor. In the following gist, you can try ArcFace for your dataset.

This is the end of this article. I hope you can feel the capability of ArcFace and how to implement it. If I missed something, please let me know. Thank you for reading.