Exploring Other Face Recognition Approaches (Part 1) — CosFace

Published in

Analytics Vidhya

7 min readAug 17, 2020

After exploring various face detection approaches other than the standard MTCNN or face cascades in the series of 4 articles (1, 2, 3, 4). Lets discuss about the next step for face recognition based system i.e extracting the feature from faces for comparing with all the other faces present in the database. Here also we won’t be exploring the common models which are facenet which uses the triplet loss and dlib’s resnet based face recognition model which uses hinge loss. In this series of articles we will be exploring other methods for getting the face feature vector other than the standard ones. In part 1 we will be covering CosFace.

We will be covering three different types for face recognition approaches:
1. CosFace
2. ArcFace
3. DREAM: Deep Residual Equivariant Mapping

Introduction

Large Margin Cosine Loss (LMCL) which is referred as CosFace, reformulates the traditional softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on this a cosine margin term is introduced to further maximize the decision margin in the angular space. As a result of which we get minimum intra-class margin and maximum inter-class margin for accurate face verification.

Overview of the proposed CosFace framework. In the training phase, the discriminative face features are learned with a large margin between different classes. In the testing phase, the testing data is fed into CosFace to extract face features which are later used to compute the cosine similarity score to perform face verification and identification.

Approach

First we will be discussing about the CosFace loss, then the effect of feature normalization to improve the results and lastly the effect of margin has over the loss function.

Large Margin Cosine Loss

The softmax loss separates features from different classes by maximizing the posterior probability of the ground-truth class. Given input feature vector x and its label y, softmax loss can be formulated as:

where p denotes the posterior probability of x being correctly classified. N is the number of training samples and C is the number of classes. f is denoted as the activation of fully-connected layer with weight vector W. Keeping bias as 0 for simplicity. As a result f is given by :

where θ, is the angle between W and x. the formula suggests that both norm and angle of vectors contribute to posterior probability.

For effective feature learning, the norm of W should be necessarily invariable so we fix norm(W) = 1 by L2 normalization. As during the testing stage we compare the two face feature vector using cosine similarity, we can say that the norm of the feature vector is not playing any role to the scoring function. Thus while training, we can fix norm(x)=s. So the posterior probability only relies on the cosine of angle and thus the loss can be formulated as:

As we have fixed the norm(x) to s, the resulting model learns features that are separable in the angular space which is referred here as Normalized Version of Softmax Loss (NSL).

But NSL loss is not sufficient enough as it only emphasizes on correct classification. To address the issue cosine margin is introduced to the loss function.
Considering an example of binary-classes , let θi denote the angle between the learned feature vector and the weight vector of Class Ci (i = 1, 2). The NSL forces cos(θ1 ) > cos(θ2 ) for C 1 , and similarly for C2 , so that features from different classes are correctly classified. To develop a large margin classifier, we further require cos(θ1 ) − m > cos(θ2 ) and cos(θ2 ) − m > cos(θ1 ), where m ≥ 0 is a fixed parameter introduced to control the magnitude of the cosine margin. Since cos(θi ) − m is lower than cos(θi ), the constraint is more stringent for classification. Hence LMCL is formulated as :

Subject to ,

where N is the number of training samples, xi feature vector with corresponding label yi, Wj is weight vector and θj is the angle between Wj and xi.

The comparison of decision margins for different loss functions the binary-classes scenarios. Dashed line represents decision boundary, and gray areas are decision margins.

Softmax loss defines a decision boundary by :
norm(W1)cos(θ1) = norm(W2)cos(θ2), thus is boundary depends on both magnitude of weight vectors and angle hence the decision margin is overlapping in the cosine space.

NSL normalizes the weight vector to have magnitude 1 and thus the decision boundary is given by : cos(θ1) = cos(θ2). As can be seen from the above figure by removing the radial variation, it can perfectly classify the samples with margin=0. But it is not robust to noise.

A-Softmax improves the softmax loss by introducing an extra margin making the decision boundary as :
C1 : cos(mθ1) ≥ cos(θ2)
C2 : cos(mθ2) ≥ cos(θ1)
The third plot in the above figure depicts the decision area, where the gray area is the decision margin. However the margin of A-Softmax is not consistent with all θ values, he margin becomes smaller as θ reduces, and vanishes completely when θ=0.

LMCL defined the decision margin in cosine space rather than in angle space by:
C 1 : cos(θ1) ≥ cos(θ2) + m
C 2 : cos(θ2) ≥ cos(θ1) + m
cos(θ1) is maximized while cos(θ2) being minimized for C1 (similarly for C2) to perform the large-margin classification. The last subplot of above figure illustrates the decision boundary of LMCL in the cosine space, where we can
see a clear margin( √2m) in the produced distribution of the cosine of angle. This shows that LMCL is more robust than NSL.

Feature Normalization

To derive the formulation of cosine loss and remove radial variation, both the weight vector and feature vector is normalized. As a result, feature vectors gets distributed on the hypersphere, where the scaling parameter s[defined earlier] controls the magnitude of radius.

Why feature normalization necessary?
The original softmax loss without feature normalization implicitly learns both the Euclidean norm (L2 -norm) of feature vectors and the cosine value of the
angle. The L2 -norm is adaptively learned for minimizing the overall loss, resulting in the relatively weak cosine constraint. On the contrary, LMCL requires the entire set of feature vectors to have the same L2 -norm such that the learning only depends on cosine values to develop the discriminative power. Feature vectors from the same classes are clustered together and those from different classes are pulled apart on the surface of the hypersphere.

What should be the value of parameter ‘s’ ?
Given the normalized learned feature vector x and unit weight vector W , the total number of classes as C. Suppose that the learned feature vectors separately lie on the surface of the hypersphere and center around the corresponding weight vector. Let Pw denote the expected minimum posterior probability of class center (i.e., W ), the lower bound of s:

Based on this bound, we can say that s should be enlarged consistently if we expect an optimal Pw for classification with a certain number of classes. The desired s should be larger to deal with more classes since the growing number of classes increase the difficulty for classification. A hypersphere with large radius s is therefore required for embedding features with small intra-class distance and large inter-class distance.

Effect of cosine margin ‘m’

The optimal choice of m potentially leads to more promising learning of highly discriminative face features.
A reasonable choice of larger m ∈ [0, C/(C−1)) {Please refer the paper given in reference to understand this range of m} should boost the learning of highly discriminative features. As all the feature vectors are centered together according to the weight vector of the corresponding class. In fact, the model fails to converge when m is too large, because the cosine constraint
(i.e., cos θ1 −m > cos θ2 or cos θ2 −m > cos θ1 for two classes) becomes stricter and is hard to be satisfied. Besides, the cosine constraint with overlarge m forces the training process to be more sensitive to noisy data. The ever-increasing m starts to degrade the overall performance at some point be-
cause of failing to converge.