In defense of triplet loss for person re-identification

Guide to best research on triplet loss in recent years.

Published in

VisionWizard

8 min readMay 9, 2020

The focus of this story is to discuss an important research paper In defense of Triplet Loss that was released in late 2017 talking about the importance of triplet loss in the person re-identification problem. The work in the paper can be extrapolated to general re-identification problems.

Person re-identification means identifying the same person from a different viewpoint (maybe a different camera, location, etc.).

If you are unaware about triplet loss, please check out this quick read before going ahead.

Triplet loss makes sure that, given an anchor point xa, the projection of a positive point xp belonging to the same class (person) ya is closer to the anchor’s projection than that of a negative point belonging to another class yn, by at least a margin m

Introduction
Proposed Hypothesis
Experiments and Results
Conclusions
References

Introduction

Person Re-ID problem domain has seen a lot of significant advancements from the computer vision community in recent years. One prominent contribution came from FaceNet[1] in 2015. The triplet loss was first introduced in the FaceNet paper in 2015.

Re-identification is finding an object from prior information. Face recognition is an example of re-identification where we find a face by matching it in a database of many faces. We achieve this by storing the low dimensional embedding of the face image.

The goal of embedding learning from a deep learning perspective is to learn a function which maps semantically similar points closer in a lower-dimensional space using a similarity metric.

The authors talk about the prevailing idea among the community is that the triplet loss is inferior as compared to the standard approach of classification and verification losses followed by a metric learning step when it comes to embedding learning.

It has a clustering effect on the overall data. As you can see, similar images are pulled together closer.

What are the authors trying to solve?

The authors saw a problem with the approaches that were successful at the time. These approaches proposed classification and verification would be an ideal strategy for learning embeddings.

Basically, having multiple networks like one for learning features, classify them, and another for how similar those features are. The fundamental problem with it was the complexity and expensiveness to generate final embeddings.

Proposed Hypothesis

The author proposed a solution where a general CNN along with a variant of triplet loss with smart data mining strategy can lead to state-of-the-art results by optimizing for the final step, thereby rendering additional learning steps obsolete.

The two main contributions of the paper are as follows

Evaluation of variants of triplet loss named ‘Batch Hard’ loss, and it’s soft margin version.
Triplet loss with general CNN with no special layers or additional networks using pre-trained weights or training from scratch can lead to state-of-the-art results on standard benchmarks datasets.

The major issue with triplet loss formulation is as the dataset grows by order of n, the number of possible triplets grows by an order n³. Computing triplets across the whole dataset is not computationally efficient and is not used in practice.

I have explained in detail how to deal with this issue via a simple yet effective triplet selection strategy in this article.

In a classical implementation, once a certain set of B triplets has been chosen, their images are stacked into a batch of size 3B, for which the 3B embeddings are computed, which are in turn used to create B terms contributing to the loss.

Given the fact that there are up to 6B²-4B possible combinations of these 3B images that are valid triplets, using only the B of them seems wasteful. [2]

So the author’s proposed three different loss strategies in order to effectively online mining hard triplets for a batch during training.

Batch Hard

Choose P classes (person identities)
Choose K images per class (person)

This will result in PK images per batch. Now for every anchor ‘a’ in the PK images, we will find the hardest positive and hardest negative sample for that sample ‘a’ across the batch.

Source : In defense of triplet loss paper

Let me break down the math in the equation above.

We iterate through all ‘PK’ images.

For each sample anchor ‘a’, we will find a corresponding positive sample having maximum(max) distance across all the other samples from the same class.
Similarly, for the same sample ‘a’, we will go through all samples except that of the class as ‘a’ and find a corresponding negative sample from which the ‘a’ has the minimum(min) distance.
The margin(m) term in the equation is to say that how far we want the positives(images of same the class) and negatives(images of other classes) should be away.

After going through PK images, we will have a total of PK triplets for training which can be considered a semi hard triplets over the entire dataset.

Batch All

This is an extension of the above strategy to all possible combinations of triplets from a batch of PK images: PK*(PK-K)*(K-1) triplets will contribute to the loss term.

All those summations might look scary but trust me they are not

Let me break down the math in the equation above.

We iterate through all ‘PK’ images

For each sample ‘a’ which acts an anchor, we will compute the distance for all positives and negatives.
We will reiterate the same process for all samples. Thus the name batch all. (I’d like to call it brute all strategy :P)
Here we are not choosing a hard positive and hard negative per anchor, instead of per anchor all positives and all negatives that contributes to loss term, thereby ending up with PK * (PK-K)*(K-1) triplets.

Lifted Embedding Loss

It is similar to batch all but the only difference is the logarithmic of summation of the distances. It is based on the lifted structured loss mentioned in this paper.

The distance metric D used in the losses above is the non-squared distance learning metric. Initially, while experimenting the authors found out that the squared Euclidean is more prone to collapse in training early on.

The authors further propose an idea of soft margin version, which can be applied to the above losses. The idea is to replace hinge function [m + …]+ by a softplus function ln(1+e^x) which has similar behaviour to hinge but decays exponentially instead of hard cut-off.

Experiments and Results

The authors did three main experiments as follows:

They played around with different hyper-parameters, variants of triplet loss to identify the best possible settings for training the person re-id networks and evaluated on the MARS validation set.
Performance of different variants of triplet loss.
State-of-the-art results on CUHK03, Market-1501 and MARS test sets using pre-trained and trained form scratch models.

They proposed two model versions

TriNet

Used ResNet-50 architecture and the pretrained weights provided by He et al[3].

LuNet

LuNet follows the style of ResNet-v2, but uses leaky ReLU nonlinearities, multiple 3 x 3 max-poolings with stride 2 instead of strided convolutions, and omits the final average pooling of feature-maps in favor of a channel-reducing final res-block. [2]
As the network is much more lightweight (5:00M parameters) than its pretrained sibling, we sample batches of size 128, containing P = 32 persons with K = 4 images each.
For an in-depth description of the architecture is given in the Supplementary Material in the paper[2].

The best score was obtained by the soft-margin variation of the batch hard loss.
One important note is to remember that it works well for Person Re-ID but not necessarily for other domains.

Results of TriNet and LuNet from the paper

To show that the actual performance boost is indeed gained by the triplet loss and not by other design choices, we train a ResNet-50 model with a classification loss with Identification(I) and Verification(V) which underperformed.[2]

Furthermore, I would encourage the readers to go through the training and evaluation discussions in the paper to study in-depth about ablation experiments the authors did to reach the optimum parameters for training and other findings.

Conclusion

In this paper, the authors have shown that, contrary to the prevailing belief, the triplet loss is an excellent tool for person re-identification.
The authors propose a variant that no longer requires offline hard negative mining at almost no additional cost.
Combined with a pretrained network, they set the new state-of-the-art(at the time of the release of paper) on three of the major Reid datasets. Additionally, they show that training networks from scratch can lead to very competitive scores.

Thank you reading the article. I hope as a writer I was able to convey the topic with utmost clarity. Please leave a comment if you have any feedback/doubts.

PS: I am trying to make research ideas available to all and it would be a great help if you can spread the word out by sharing and following.

References

[1] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In CVPR, 2015

[2]In defense of triplet loss for Person Re-Identification. In CVPR 2017

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning
for Image Recognition. In CVPR, 2016.

[4] https://www.medium.com/visionwizard