Digging Deeper into Metric Learning — Loss Functions

Published in

slyce-engineering

12 min readNov 8, 2019

In the previous article, we discussed how recent advancements in deep learning have made it possible to learn a similarity measure for a set of images using a deep metric learning network that maps visually similar images onto nearby locations in an embedding manifold, and visually dissimilar images apart from each other. Deep features learned using this approach result in well discriminative features with compact intra-product variance and well separated inter-product differences, which are key to have better visual search engines. In addition, learning such discriminative features enables the network to generalize well on unseen product images, which end up forming new clusters in the embedding space.

*(a) Separable Features (b) Discriminative Features*

We train these networks similar to the other deep learning networks but with different loss functions that explicitly push similar images together in embedding space and pull dissimilar images away from each other, during training. At any instance, 𝑡t during training, we feed images from our training dataset to the network and map them to an embedding space. We compute penalty on these mappings using a loss function and adjust network weights using an appropriate optimization technique. While performing the inference, we extract the features from the embedding layer and perform the nearest neighbor search to identify similar images.

In this article, we will give an overview of some of the widely used loss functions employed to learn metric learning embeddings.

Training and inference routine for Metric Learning Models. Note that, although there are various effective training routines available to achieve the same objective, here we are only concerned about the most basic and well-generalized training framework, shown in the picture.

Contrastive Loss

To minimize the distance between features of two similar images and force dissimilar image features to be apart from each other, contrastive loss function encodes both the measures, similarity, and dissimilarity, independently in a loss function. It directly addresses a distance between image features of two similar images, and the distance between two dissimilar image features is included in the loss if they are not separated by a distance margin, α. In that sense, the loss function is defined as follows.

Here, m is the number of anchor-positive example product pairs available in a mini-batch. Dᵢⱼ = ||f(xᵢ) — f(xⱼ)||² is the distance between deep features f(xᵢ) and f(xⱼ) correspond to the images xᵢ and xⱼ respectively. yᵢⱼ= +/-1 is the indicator of whether a pair (xᵢ,xⱼ) share the same label or not. [.]⁺ is the hinge loss function max(0,.). Each pair of feature vectors independently generates a loss term, as shown in the figure below.

Edges colored red represent example pairs sharing similar labels and contributes to a loss term via Dᵢⱼ. This term addresses intra-class variance. On the other hand, blue represents example pairs with different product labels and uses α — Dᵢⱼ to generate a loss number. This term addresses inter-class separation among the feature clusters.

To compute the loss at every step for a dataset of the size N, we have to perform O(N²) pairwise distance computations. If we take a similar approach as stochastic gradient descent using mini-batches of size B, then computations are of the order of O(B²) instead. However, if we chose mini-batches randomly from the original dataset, we will most likely end up without any positive pairs in a mini-batch and hence we would not be optimizing the intra-class variance in this setup. Hence there is a need to explicitly choose a sizeable number of positive pairs per mini-batch, preferably closer to about 50% of mini-batch size to attain both compact intra-class variance and large inter-class separation. As an example, let’s say, we set the mini-batch size to be B=32, then

We would select 16 products at random.
Next, we will pick 2 random images from each of the 16 products.
Now, in the minibatch of size 32, we have 16 positive pairs and 15 * 16 negative pairs.

This extra step of handpicking mini-batch would ensure that loss term captures intra-class variance as well without any additional computational cost.

Although the contrastive loss function is very intuitive to comprehend, it does not account for the scale of visual dissimilarity for all negative examples, when we set a constant margin, α. This limits the embedding space to have only a subset of all possible distortions while training and arrange visually diverse products similar distance apart as visually similar. For example, consider images of three products shown below — product A, product B, and product C. Relatively, product A and product B are visually more similar to each other than product A and product C. However, when we set the constant margin of 𝛼, we push the feature vectors of product B and product C away from the feature of product A by the same margin 𝛼, whereas we want product C to be farther away than product B from product A.

Product A and B are visually similar, with different product labels. Product C is visually different from the other two products. Despite having a different scale of visual similarity, margin α is constant for all three products.

Triplet Loss

Triplet loss is probably the most popular one of metric learning loss functions. Triplet loss takes in a triplet of deep features, (x_ia, x_ip, x_in), where (x_ia, x_ip) share the same product label and (x_ia, x_in) are from two different products and tunes the network so that distance between anchor (x_ia) and positive (x_ip), D_ia,ip to be less than the distance between the anchor (x_ia) and negative (x_in), D_ia,in, by at least distance margin α.

Here, D_ia,ip = ||f(x_ia) — f(x_ip)||₂ and D_ia,_in = ||f(x_ia) — f(x_in)||₂. Each loss term independently considers a triplet relative to a predefined anchor example to calculate loss as shown in the figure below.

*In a triplet (X1, X2, X3), the red-colored line connects an anchor example X1 and positive example X2. Solid blue line connects anchor* 𝑋1X1 and negative example X3, that shares different product labels. Unlike contrastive loss, each loss term addresses intra-class variance and inter-class separability together.

Since the margin in the loss is not based directly on the distance between two embeddings, but on the relative distance between the pairs in the triplet with respect to anchor, it takes arbitrary embedding space distortions into account compared to that of contrastive loss.

While the triplet loss can address this issue of the contrastive loss function, it is computationally expensive. During training with a minibatch of 𝐵 triplets, we will need to compute the distance between all possible triplets in the minibatch, it results over O(N³) possible (not necessarily valid) triplets, which is computationally infeasible when it comes to training over a large dataset. For instance, in a simple scenario, where a dataset with 20 different product classes, each containing 15 images, one can have 20 * combination of 2 from 15 images = 2100 valid triplets (2100 possible anchor-positive example pairs and each chooses one negative product example from the dataset). To get the aggregate loss over all the possible pairs, it requires 2100 / m mini-batch (of the size m) iterations. Moreover, as training converges most triplets satisfy the margin constraint between a positive pair and negative pair distances. It results in minor contributions from most of the triples. This leads to learning slowly and thereby converging slowly as well.

To address this slower training convergence, ‘semi-hard’ and ‘hard’ negative mining-based approaches are commonplace in most of the training routines.

Lifted Structure Loss

While training a CNN with triplet loss objective, it fails to utilize the full mini-batch information while generating a loss, mainly because positive and negative product examples are predefined for the given anchor example only. The idea of the lifted structure loss is to improve a mini-batch optimization using all O(B²) pairs, available in the batch, instead of (B) separate pairs. The following figure suggests how this function is extended to make full use of a mini-batch via distance matrix of pairwise distances.

*An anchor and positive example pair, (X1, X2) and their interaction with remaining pairs in a mini-batch.*

Note that, both the nodes of a pair independently interact with all the available negative nodes in a minibatch. It is different from the triplet loss term in the sense that, it does not define negatives considering only predefined anchor, ignoring all the other negative points available in a mini-batch. It also treats positive examples as an anchor to find its negative example and contribute to a loss term. It helps CNN training to have faster and better converge.

It is imperative to note that, ‘randomly’ chosen negative product examples from a mini-batch of examples, for a given pair 𝑋1X1 and 𝑋2X2 as shown in the figure above, does not promote faster network convergence as ‘hard’ negatives do. To improve convergence speed, the author suggests mining ‘hard’ negatives individually for each of the examples in a given pair, as shown in the figure below.

*Hard negatives with respect to X1 and X2 are shown in the solid blue line. Although X1 considers X6 as a negative it interacts with all the available negatives in a mini-batch.*

Lifted structure loss function minimizes smooth upper bound for stable network training. Concretely, lifted structured loss function for a given mini-batch of images can be written as follows.

where, Dᵢⱼ = ||f(xᵢ) — f(xⱼ)||₂. P and N are all the positive and negative pairs available in a mini-batch respectively.

N-Pair Loss

Multi-class N-pair loss is similar to Lifted Structure loss in the sense that it recruits multiple negative product examples while generating loss term in a given mini-batch and does not suffer from slower convergence like triplet loss and contrastive loss.

For each loss term L, an anchor example (X1) of a given pair (X1, X2)
utilizes N-1 negative product examples available in a mini-batch (that is X4 and X6 here), when N is the number of pairs available in a mini-batch.

Follow the equation below to generate a loss using N pairs of examples.

Where N is the total number of pairs of images with similar product labels. If we consider f as a feature vector, f+ as a weight vector and denominator on the right-hand side as a function of likelihood P(y=y+), the above equation is similar to as a multi-class logistic loss (i.e. softmax loss). Additionally, this loss function performs better when the training dataset contains a large number of product classes. Larger the value of N (number of distinct pairs), more accurate the approximation.

One benefit of using N-pair loss over Lifted Structure loss is, it tries to optimize cosine similarity between a positive anchor and negative product samples in a probabilistic way. In other words, it calculates cosine similarity between features of a pair and tries to increase the probability of those features for being in the same product class using pairwise comparisons in a mini-batch. Since the cosine similarity metric (and also probability) is a scale-invariant (illustrated in the figure below), N-pair loss tends to be robust to the variations in features during training.

N-Pair loss directly addresses the cosine similarity between an anchor(x1) and positive example(x2), and compare it to the similarity between positive example and other negative examples in a probabilistic manner.

Angular Loss

Contrary to the metric learning methods discussed above, which are more focused for optimizing absolute distances (contrastive loss) or relative distances (Triplet loss, Lifted Structure Loss, N-Pair loss), Angular loss proposes to encode a third-order relation inside the triplet triangle in terms of an angle at the negative edge.

Similar to N-pair loss, it benefits from a scale invariance by defining loss function keeping angular distance (cosine) in mind. It motivates a push for negative feature vector away from the positive cluster and drags the positive points closer to each other, as shown in the figure below.

There are several advantages of using this loss function stated below:

Different from the euclidean distance, angular(cosine) distance is a similarity-transform-invariant metric. While considering angular geometry it does not only benefits from scale invariance but also introduces rotation-invariance. Even though image features get rescaled quite frequently while training, for a fixed margin of α, ∠n ≤ α always holds. In simple words, angular geometry view in a loss term is more robust to the local variations of a feature map.
The cosine rule explains the calculation of ∠n requires all the three sides of the triangle. In contrast, the original triplet only practices two sides into account. The added constraint encourages robustness and effectiveness for the optimization.
Choosing a loss margin, m, for loss term is not a straightforward task when euclidean is used as a distance metric. Mainly because as the dataset size grows, intra-class variations among the target product classes vary a lot. Without a meaningful reference, it is critical to tune such hyper-parameter. By comparison, αα is simpler to set, due to its scale-invariant behavior.

Additionally, the angular loss can be easily combined with the traditional metric learning loss functions to boost the overall performance. An example of usage where it is combined with the N-pair loss function is given as follows.

λ is a trade-off weight between N-pair and the angular loss, which is set to 2 typically. α can be set between 35 and 60 degrees.

Divergence Loss

Although all the metric learning objective functions explained above, embed given images into an embedding manifold, they do not necessarily look at the various aspects of the same image explicitly. Divergence loss explores this perspective through different ensemble modules. Divergence loss is a regularizing term, we can add to metric learning loss functions for joint-supervision. It increases the distance between features of an image learned by different ensemble modules. In other words, it encourages each learner to focus on a different attribute of an input image, as shown in the figure below.

Here, (xᵢ, yᵢ) is a set of all training samples and labels. L₁ is a metric learning loss function for mᵗʰ learner and 𝐿₂ is regularizing term for diversifying the feature embedding of each learner 𝑚m. λ is the weighting parameter to control the strength of the regularizer.

where xᵢ is a feature vector corresponding to an image x. D_ia, D_ip represents a distance measure between feature embeddings of an image embedded by two different learners a and p. m is a margin for divergence loss and it is usually set to 1. [·]⁺ denotes the hinge function max(0, ·).

Divergence loss pulls apart the feature embeddings of different learners. In other words, it encourages feature instances, learned via different ensemble module, to attend different parts of an imaged object (X6 here). It results in diverse embedding space for all learners. Each learner individually satisfies the similarity constraint of projecting similar product labels nearby.

Conclusion

In this article, we discussed a family of metric learning loss functions, which play a crucial role in training distance metric learning-based convolutional neural network architectures. These loss functions enable the networks to address some of the limitations of conventional object recognition routines in that they can work with product classes with fewer image instances available and make a flexible system that can be easily expanded to new product classes.

Although metric learning networks based on these loss functions have shown great success in building an effective visual search solution for our clients, the training of such systems is computationally expensive and time-consuming. Computing the loss on a dataset of size 𝑁 and taking a gradient step is of the order of O(N²) or O(N³), which restricts us to use mini-batch (size 𝐵) in gradient descent approach. This might reduce the computation to O(B²) or O(B³), but how we construct the minibatch becomes extremely vital.

In stochastic gradient descent, we construct minibatch by randomly sampling the dataset and with reasonably sized batch sizes. However, if we construct mini-batches using random sampling in the triplet loss function, for example, we most likely will not end up with any positive pairs or even negative pairs that violate margin. In that case, the loss contribution from the minibatch will be an extremely small and hence gradient step in training. So, by taking a minibatch gradient approach, we might have reduced computation with minibatch, but if mini-batches do not have enough loss contribution, we might end up doing more computations to reach an optimal point.

Therefore, the method of sampling the training dataset and constructing minibatch, so that it is informative enough about the loss, while not being too large is critical in expediting the metric learning training. We will conclude with this motivation for the importance of sampling strategies in the successful application of metric learning, which will be our focus in the next article.

References

[1] Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (Vol. 2, pp. 1735–1742). IEEE.

[2] Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016, October). A discriminative feature learning approach for deep face recognition. In European conference on computer vision (pp. 499–515). Springer, Cham.

[3] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).

[4] Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4004–4012).

[5] Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems (pp. 1857–1865).

[6] Wang, J., Zhou, F., Wen, S., Liu, X., & Lin, Y. (2017). Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2593–2601).

[7] Kim, W., Goyal, B., Chawla, K., Lee, J., & Kwon, K. (2018). Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 736–751).

Contributors — Jay Patel Jay Patel, Jake Buglione Jake Buglione